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Abstract 


This  thesis  can  be  viewed  as  a  collection  of  work  in  differential  privacy 
and  in  clustering.  In  its  first  part  we  discuss  work  aimed  at  preserving  differ¬ 
ential  privacy  in  a  social  network,  with  respect  to  either  the  presence/absence 
of  a  single  edge  [41],  or  with  respect  to  changing  all  edges  adjacent  to  one 
node  [42].  In  its  second  part  we  discuss  multiple  clustering  problems,  focus¬ 
ing  on  the  A- means  and  A  - median  problems.  We  show  how  to  correctly  cluster 
an  instance  whose  optimal  A;-means  solution  either  satisfies  a  certain  stability 
condition  [20],  or  is  resilient  to  small  constant  metric  perturbations  [21],  or 
with  cluster  centers  that  satisfy  a  particular  separation  condition  [22]. 

Alternatively,  this  thesis  can  be  viewed  as  an  investigation  of  specific  non¬ 
worst  case  analysis  paradigms.  The  common  theme  among  of  all  results  com¬ 
posing  this  thesis  is  that  they  all  introduce  algorithms  whose  guarantees  are 
meaningful  only  for  a  subset  of  inputs  -  for  inputs  that  satisfy  certain  nice 
properties,  or  assumptions.  These  assumptions  can  be  roughly  divided  into 
two  types,  which  we  refer  to  as  explicit  or  implicit.  Explicit  assumptions  give 
a  very  specific  and  quantifiable  characterization  of  the  input  (e.g.  clustering 
instances  with  the  distance  between  any  pair  of  cluster  centers  larger  than  a 
specific  bound).  On  the  other  hand,  implicit  assumptions  are  harder  to  char¬ 
acterize.  Implicit  assumptions  pose  a  certain  property  that  the  input  should 
satisfy,  due  to  some  compelling  “real-life”  reasoning  (e.g.  justifying  a  par¬ 
ticular  value  of  k  for  a  A;-means  clustering  instance),  and  often  give  much 
leeway  as  to  the  particular  structure  of  the  input.  In  this  thesis,  we  exhibit 
multiple  examples  of  assumptions  of  both  kinds,  in  differential  privacy  and 
clustering,  and  give  algorithms  that  take  advantage  of  these  assumptions.  In 
particular,  we  show  how  tasks  that  are  provably  hard  become  feasible  under 
suitable  assumptions;  tasks  like  providing  accurate  answers  for  queries  over 
graph  while  preserving  privacy  on  the  node  level,  or  giving  a  c-approximation 
for  the  A;-median  objective  for  c  <  1  4-  e-1. 
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Part  I 


Overview 


‘First,  I  want  to  give  you  an  overview  of what  I  will  tell  you  over 
and  over  again  during  the  entire  presentation.  ’ 
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Chapter  1 
Introduction 


This  thesis  details  new  results  in  differential  privacy  and  in  clustering  that  are  obtained 
using  non-worst  case  analysis.  Whereas  traditional  approach  requires  us  to  devise  algo¬ 
rithms  that  are  suppose  to  work  for  any  instance  (algorithms  which  are  tractable  even  in 
the  worst-case),  this  thesis  takes  a  complementary  approach.  In  each  of  the  works  detailed 
in  this  thesis  we  assume  the  given  instance  has  a  certain  property  that  makes  it  “nice”,  and 
leveraging  on  this  assumption  we  devise  tractable  algorithms  even  for  problems  which  are 
provably  hard  in  general. 

Indeed,  one  may  view  this  thesis  simply  as  an  amalgamate  of  work  in  differential 
privacy  and  in  clustering,  and  so  it  is  partitioned  into  a  part  that  deals  with  differential 
privacy  (Chapters  5  and  6)  and  a  part  that  deals  with  clustering  (Chapters  7,  8  and  9). 
We  prefer  however  to  view  this  thesis  as  a  study  of  non-worst  case  analysis,  where  the 
main  theme  of  this  thesis  is  to  investigate  assumptions:  what  explicit  assumptions  make 
certain  algorithms  useful,  and  what  characterizes  instances  that  adhere  to  certain  implicit 
assumptions,  often  made  in  “real-life”. 


1.1  Problem  Overview:  Privacy  and  Clustering 

1.1.1  Differential  Privacy 

It  is  no  secret  that  privacy  is  a  rising  concern  in  today’s  world.  Databases  of  enormous 
magnitude  are  kept  essentially  everywhere,  from  hospitals  who  keep  the  sensitive  data  of 
patients,  to  network  providers  who  keep  track  of  web-surfers,  and  to  government  held  cen¬ 
suses.  Naturally,  participants  in  such  datasets  wish  to  keep  their  sensitive  data  private.  In 
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other  words,  data  curators,  with  access  to  participants’  sensitive  data,  need  to  guarantee 
each  participant  that  the  data  will  be  accessed  only  in  a  way  that  will  not  reveal  (much 
about)  her  identity.  In  a  series  of  works  [61,  51,  44,  63,  62],  the  rigorous  notion  of  differ¬ 
ential  privacy  was  established. 

Definition  1.1.  A  dataset  D  is  a  (multi-)set  of  elements  from  some  predefined  domain  X. 
Two  datasets  D  and  D'  are  called  neighbors  if  they  differ  only  on  the  details  of  a  single 
individual.  An  algorithm  ALG  is  said  to  present  (e,  5) -differential  privacy  if  for  any  two 
neighboring  datasets  and  any  subset  S  of  potential  outputs  it  holds  w.p.  >1  —  5  that 

Pr[ALG(L>)  e  S]  <  ecPr[ALG(Zy)  e  S] 

By  now,  there  are  hundreds  of  works  in  differential  privacy,  detailing  an  abundance  of 
privacy  preserving  algorithms  aimed  at  a  multitude  of  tasks.  Yet  the  core  technique  that 
the  majority  of  these  algorithms  apply  is  the  same  technique  originally  proposed  by  Dwork 
et  al  [63].  Given  a  query  function  /,  the  curator  first  estimates  the  global  sensitivity  of  /, 
denoted  GS(f )  =  ma xDyD'  f(D )  —  f(D'),  then  outputs  f(D )  +  X  where  X  is  a  random 
noise  sampled  from  a  suitable  distribution,  typically  a  Laplace  or  Gaussian,  with  0  mean 
and  variance  proportional  to  ( GS(f)/e )2. 

1.1.2  Clustering 

Clustering  is  a  “brand  name”  for  a  variety  of  tasks  in  computer  science,  which  roughly 
translates  to  finding  a  good  partition  of  a  given  collection  of  n  datapoints.  In  this  thesis  we 
take  the  machine  learning  view  of  clustering  -  unbeknownst  to  us,  there  exists  some  true 
target  partition,  and  the  partition  our  algorithm  retrieves  should  be  as  close  as  possible  to 
this  target  clustering.  In  general,  we  assume  the  target  partition  has  k  clusters,  but  no  dat- 
apoint  is  labeled.  Instead,  we  are  given  access  to  a  metric  -  the  distances  between  any  two 
datapoints,  and  we  aim  to  retrieve  a  good  partition  of  the  dataset  using  the  given  metric. 
As  such,  we  must  assume  a  connection  between  the  metric  and  the  target  clustering. 

Indeed,  multiple  such  connections  were  proposed.  A  common  paradigm  in  clustering 
is  to  impose  a  quantitative  objective,  such  as  k- median  or  A-mcans,  with  the  assumption 
that  the  target  clustering  identifies  with  (or  is  close  to)  the  A; -partition  that  minimizes  this 
objective.  Unfortunately,  both  the  A  - median  and  the  A;-means  objective  are  NP-hard,  and 
so  much  effort  has  been  put  into  approximating  these  objectives  [90,  98].  A  separate 
paradigm  investigates  the  notion  that  the  distances  reflect  (dis)similarities  between  dat¬ 
apoints.  So  roughly  speaking,  based  on  the  notion  that  points  in  different  clusters  have 
“large”  distances  and  points  from  the  same  cluster  have  “small”  distance,  one  should 


4 


be  able  to  retrieve  a  good  partition  of  the  data  [27].  The  last  paradigm  this  thesis  dis¬ 
cusses  deals  with  points  in  Euclidean  space,  where  one  assumes  a  separation  condition 
between  the  centers  of  respective  clusters.  This  often  reflects  the  case  of  a  mixture  model, 
where  each  cluster  is  characterized  by  a  certain  distribution,  and  the  cluster  points  are 
iid  samples  from  this  distribution.  In  this  model,  the  expected  distance  between  two  dat- 
apoints  depends  on  the  variance  of  the  distribution.  In  contrast,  the  distance  between 
two  datapoints  from  different  clusters  is  a  function  of  the  variances  of  the  two  distribu¬ 
tions  and  also  the  distance  between  their  respective  centers.  Therefore,  numerous  results 
(e.g.  [56,  13,  138,  4])  are  based  on  the  assumption  that  the  distance  between  the  centers  is 
sufficiently  large  -  a  fact  which,  combined  with  the  iid  assumption  and  the  specific  nature 
of  the  distribution,  allows  us  to  cluster  correctly  all  (or  most)  points.  Further  details  can 
be  found  in  Chapter  4. 


1.2  Results  Overview:  Privacy  and  Clustering 

1.2.1  Differential  Privacy  in  Social  Networks 

We  study  the  notion  of  differential  privacy  in  social  networks.  In  the  simplest  of  models, 
a  social  network  is  no  more  than  a  graph,  and  one  of  the  most  common  types  of  queries 
regarding  a  graph  is  cut-queries:  given  a  set  of  vertices  S,  how  many  edges  have  one 
endpoint  in  S  and  another  in  S'?  In  Chapter  5  we  give  an  algorithm  that  answers  cut- 
queries  while  preserving  differential  privacy  w.r.t  edge  changes.  Indeed,  for  a  specific  cut 
(S',  S'),  the  presence  or  absence  of  a  single  edge  changes  the  value  of  the  cut  by  no  more 
than  1,  so  the  classical  technique  of  adding  a  small  random  noise  allows  us  to  answer 
a  single  cut-query  fairly  accurately  (while  preserving  differential  privacy).  The  problem 
lies  in  answering  multiple,  and  perhaps  even  all,  cut-queries.  Our  technique  essentially 
publishes  a  perturbed  Laplacian  of  the  graph,  which  w.h.p  gives  fairly  accurate  answers  to 
many  cut-queries  simultaneously.  This  result  is  based  on  the  work  in  [41]. 

In  addition,  we  also  study  the  notion  of  differential  privacy  w.r.t  vertex-changes,  i.e.  a 
vertex  can  change  its  own  attributes  or  the  set  of  edges  adjacent  to  it.  Here,  the  problem 
lies  in  answering  even  a  single  query,  as  most  queries  have  very  high  global  sensitivity 
(as  large  as  n).  In  Chapter  6  we  propose  the  notion  of  restricted  sensitivity,  where  we  act 
as  though  the  set  of  all  possible  inputs  is  restricted  to  a  certain  set,  specified  by  the  user. 
Our  technique  answers  the  query  in  a  way  that  always  preserves  differential  privacy,  and 
should  the  network  belong  to  a  user-specified  subset  of  potential  networks  then  our  answer 
is  also  fairly  accurate  -  its  noise  is  only  proportional  to  the  query’s  restricted  sensitivity 
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rather  than  its  global  sensitivity.  This  result  is  based  on  the  work  in  [42]. 


1.2.2  Clustering 

In  Chapter  7  we  consider  finite  metric  A-mcdian  instances  and  Euclidean  A- means  in¬ 
stances,  and  we  improve  on  the  notion  of  stability  of  Ostrovsky  et  al  [121].  Ostrovsky  et 
al  study  instances  in  which  the  ratio  between  the  cost  of  the  optimal  (k  —  l)-means  solu¬ 
tion  and  the  cost  of  the  optimal  /c-means  solution  is  at  least  max{100, 1/e2},  and  give  a 
(1  +  0(e)) -approximation  to  the  /c-means  optimum  for  such  instances.  Our  work  decou¬ 
ples  the  strength  of  the  assumption  from  the  quality  of  the  result:  we  show  that  under  the 
assumption  that  the  above  mentioned  ratio  is  only  1  +  a,  one  can  get  a  1  +  e  approximation 
to  the  k- means  optimum,  in  time  poly(n,  k)  exp(l/a,  1/e).  This  also  holds  for  instances 
satisfying  the  same  property  w.r.t  the  A  -mcdian  objective.  We  also  build  on  the  work  of 
Balcan  et  al  [25]  that  investigate  the  connection  between  point- wise  approximations  of  the 
target  clustering  and  a  c- approximation  for  the  A;-median  or  A;-means  problem.  In  fact,  our 
work  unifies  the  Ostrovsky  et  al  assumption  and  the  Balcan  et  al  assumption  provided  all 
clusters  have  sufficiently  many  datapoints.  This  result  is  based  on  the  work  in  [20]. 

In  Chapter  8  we  consider  clustering  instances  for  which  the  optimal  solution  is  optimal 
under  the  given  metric,  and  also  under  any  bounded  multiplicative  perturbation  of  the  given 
distances.  This  model,  proposed  initially  by  Bilu  and  Linial  [40],  is  motivated  by  the  fact 
that  in  practice,  distances  between  data  points  are  typically  the  result  of  some  heuristic 
measure  (e.g.,  edit-distance  between  strings  or  Euclidean  distance  in  some  feature  space) 
rather  than  true  semantic  distance  between  objects.  Thus,  the  optimal  solution  should  be 
correct  or  nearly  so  on  small  perturbations  of  the  given  distances  as  well.  Bilu  and  Linial 
discuss  Max-Cut  instances  satisfying  this  property  up  to  a  fairly  large  multiplicative  error 
of  0(y/n).  In  [21]  we  show  to  correctly  cluster  A;-median  and  A;-means  instances  that  are 
0(1) -perturbation  resilient  (specifically,  3-perturbation  resilient).  This  result  is  based  on 
the  work  in  [21]. 

In  Chapter  9  we  improve  on  the  work  of  Kumar  and  Kannan  [104],  that  aims  to  unify 
many  works  in  the  mixture  model.  Kumar  and  Kannan  pose  a  specific  deterministic  con¬ 
dition  on  a  dataset,  and  show  how  to  correctly  cluster  instances  that  satisfy  this  condition. 
Then  they  show  that  many  of  the  previously  studied  mixture  models,  including  the  Planted 
Partition  model  [110]  and  the  work  of  Ostrovsky  et  al  [121]  satisfy  w.h.p.  this  condition. 
Our  work  improves  on  their  condition  and  simplifies  their  analysis.  We  pose  a  condition 
only  on  the  distances  between  any  two  clusters’  centers,  and  using  the  simple  triangle-  and 
Markov-inequalities,  we  show  how  to  cluster  correctly  most  of  the  datapoints.  This  result 
is  based  on  the  work  in  [22]. 
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1.3  Explicit  vs  Implicit  Assumptions 


As  mentioned  above,  an  alternative  view  of  this  thesis  is  the  study  of  explicit  and  implicit 
assumptions  as  a  specific  case  of  non- worst  case  analysis.  The  works  which  compose 
this  thesis  do  not  give  an  algorithm  that  works  for  any  graph  or  any  clustering  instance. 
Instead,  in  each  chapter  of  this  thesis  we  assume  that  the  input  satisfies  some  property 
which  allows  us  to  propose  an  algorithm  that  returns  a  meaningful  output. 

In  this  thesis  we  differentiate  between  two  such  assumptions:  explicit  and  implicit  as¬ 
sumptions.  Of  course,  the  distinction  between  the  two  types  isn’t  formal  or  rigorous,  but 
rather  a  helpful  way  of  thinking  about  the  different  nature  of  assumptions.  Explicit  as¬ 
sumptions  give  an  exact  and  quantifiable  description  of  the  input.  They  are  best  illustrated 
by  examples:  datapoints  reside  in  a  Euclidean  space,  points  are  sampled  iid  from  a  Gaus¬ 
sian,  or  matrices  that  have  singular  values  greater  than  a  specific  threshold.  In  addition, 
explicit  assumptions  are  often  checkable  -  one  can  check  the  singular  values  of  the  given 
matrix,  or  find  (approximate)  cluster  centers  and  see  whether  they  are  sufficiently  far  apart. 

Implicit  assumptions  are  different.  They  do  not  discuss  the  exact  nature  of  the  input, 
but  rather  propose  a  property  that  the  input  should  satisfy.  Such  property  is  often  related 
to  some  “real-life”  or  outside  meaning,  typically  how  the  data  was  collected  or  how  the 
solution  will  be  used.  Such  properties  are  best  illustrated  in  relation  to  clustering.  The 
work  of  Ostrovsky  et  al  [121]  assumes  a  lower  bound  on  the  ratio  between  the  optimal 
/c-means  solution  and  the  optimal  (k  —  l)-means  solution,  using  the  justification  that  “if 
a  near-optimal  A  -clustcring  can  be  achieved  by  a  partition  into  fewer  than  k  clusters,  then 
that  smaller  value  of  k  should  be  used  to  cluster  the  data”.  The  work  of  Balcan  et  al  [25] 
assumes  that  applying  a  c-approximation  for  the  A  -mcans  objective  (for  example)  yields  a 
clustering  whose  labeling  errs  on  no  more  than  a  5-fraction  of  the  datapoints,  where  the  er¬ 
ror  is  measured  w.r.t  to  some  (unknown  and  predefined)  target  clustering,  a  property  which 
they  call  (c,  5)-stability.  Verifying  whether  such  properties  hold  might  not  be  tractable. 

In  this  thesis,  the  works  [41,  22]  deal  with  explicit  assumptions:  in  [41]  we  assume 
a  lower  bound  on  the  singular  values  of  the  input,  and  in  [22]  we  assume  an  exact  sep¬ 
aration  bound  between  cluster  centers.  In  contrast,  the  works  [20,  21]  deal  with  implicit 
assumptions:  in  [20]  we  unify  both  the  assumption  of  Ostrovsky  et  al  [121]  and  of  Balcan 
et  al  [25],  by  a  property  we  call  weak-deletion  stability ,  and  in  [20]  we  examine  clustering 
instances  that  are  perturbation  resilient,  like  in  Bilu  and  Linial  [40].  The  work  of  [42]  lies 
somewhere  in  the  middle,  between  explicit  and  implicit:  we  allow  the  querier  to  specify 
any  property  she  believe  the  input  satisfies,  thus  converting  the  querier’s  implicit  assump¬ 
tion  to  an  explicit  one  for  our  algorithm.  A  schematic  view  of  the  devision  of  works  in  this 
thesis  into  subject  as  well  as  whether  they  are  based  on  implicit  and  explicit  assumptions 
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is  given  in  Figure  1.1. 
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Figure  1.1:  Which  works  in  this  thesis  deal  with  privacy  and  which  with  clustering;  which  are 
based  on  explicit  assumptions  and  which  arc  based  on  implicit  assumptions. 
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Chapter  2 

Notation  and  Technical  Background 


Disclaimer.  This  chapter  details  important  notations  and  classic  results  in  linear  al¬ 
gebra  and  probability  we  use  throughout  this  entire  thesis.  This  chapter  however,  does 
not  contain  any  specific  background  regarding  any  of  the  themes  of  this  thesis:  privacy, 
clustering  and  various  cases  detailing  non-worst-case  analyses.  We  therefore  advise  the 
reader  to  skip  this  chapter  and  return  to  it  upon  stumbling  onto  a  non-familiar  notation 
or  theorem.  However,  for  clarity,  and  in  order  to  avoid  the  use  of  ill-defined  notation, 
we  start  with  the  technical  background  before  moving  into  the  more  specific  background 
detailed  in  Chapters  3  and  4. 


2.1  Graphs 

A  graph  is  a  pair  of  sets  G  =  (V,  E)  where  V  is  a  set  of  nodes,  and  E  C  (^)  is  a  set 
of  edges.  We  denote  n  =  \V\  and  say  that  the  size  of  G  is  n.  On  occasion  we  discuss 
weighted  graphs  for  which  there  exists  a  weight  function  w  :  E  — >■  M>q.  In  the  integral 
case,  we  identify  a  non  edge  with  an  edge  of  weight  zero,  and  an  edge  with  weight  1. 

The  adjacency  matrix  of  a  graph  is  a  symmetric  matrix  in  M"xn  with  its  (i,  j) -entry 
equal  w(i,j )  (and  diagonal  entries  of  0).  Given  an  ordering  of  the  graph’s  nodes,  we 
define  the  edge  matrix  of  a  graph  G  as  a  matrix  Eq  €  M-^xn  where  for  every  pair  of 
distinct  vertices,  a  <  b  there  exists  a  row  in  which  the  only  non-zero  coordinates  are  the 
coordinates  of  a  and  b,  and  we  have 

(EG)(a,b),a  =  Vw(a,b),  (£g)(  0,6), b  =  - y/w(a,b ) 

We  define  the  (unnormalized)  Laplacian  of  a  graph  G  as  LG  =  EJaE(:.  Simple  calcu- 
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lation  give  that  for  every  node  a  we  have  the  corresponding  diagonal  entry  (fG)M  = 
J2b^aw(a’b),  and  for  every  distinct  nodes  a,  b  we  have  ( Lc)a,b  =  —w(a,b).  Given  a 
nonempty  strict  subset  of  the  nodes  S  C  V,1  we  call  the  partition  (S,  S)  of  V  as  a  S'-cut, 
and  we  say  its  size  is  $G(S)  =  J2aeSb<£sw(ai  &)•  Given  S  C  V  we  denote  its  indicator 
vector  as  I5  G  {0,  l}n  where  for  every  i  the  coordinate  (Is),;  =  1  iff  i  G  S.  Simple 
calculation  shows  that  for  every  non-empty  S  C  V  it  holds  that 

®g(S)  =  \\Eg1s\\2  =  1tsLg1ts 

2.2  Linear  Algebra 

2.2.1  SVD 

Given  am  x  n  matrix  M  its  Singular  Value  Decomposition  (SVD)  is  M  =  f/EVT  where 
U  G  Mmxm  and  V  G  R"xn  are  unitary  matrices,  and  S  has  non-zero  values  only  on 
its  main  diagonal.  Furthermore,  there  are  exactly  rank(M)  positive  values  on  the  main 
diagonal,  denoted  >  . . .  >  crrank(M)(M),  called  the  singular  values.  This  allows 

us  to  write  M  as  the  sum  of  rank (M)  rank-1  matrices:  M  =  LUJ  (JiUivJ.  Because 
E  has  non-zero  values  only  on  its  main  diagonal,  the  notation  E'  denotes  a  matrix  whose 
non-zero  values  lie  only  on  the  main  diagonal  and  are  a\(M),  . . . ,  ■ 

Using  the  SVD,  it  is  clear  that  if  M  is  of  full-rank,  then  M~l  =  VE  ''IF,  and  that  if 
n  —  m  —  rank(M)  then  det(M)  =  11"=  1  ai(M).  Furthermore,  even  when  M  is  not 
full-rank,  the  SVD  allows  us  to  use  similar  notation  to  denote  the  generalizations  of  the 
inverse  and  of  the  determinant:  The  Moore-Penrose  inverse  of  M  is  AE  =  UE_1f/T;  and 
the  pseudo-determinant  of  M  is  det(M)  =  Yi™ "k(M)  at(M). 

2.2.2  Positive  Semidefinite  Matrices 

An  x  n  symmetric  matrix  is  called  positive  semidefinite  (PSD)  if  it  holds  that  xTM  x  >  0 
for  every  x  G  Mn.  Given  two  PSDs  M  and  N  we  denote  the  fact  that  (. N  —  M)  is  PSD  by 
M  V  N.  For  further  details  regarding  the  many  properties  of  PSD  matrices,  see  [86]. 

Fact  2.1.  Let  A  and  B  be  two  PSD  matrices  s.t.  ker(A)  =  ker(B).  If  for  every  x  we 
have  that  xTAx  <  xTBx,  then  for  every  x  it  holds  that  xJ/fx  >  xT  I  fix.  Symbolically, 
Af  B  =>  B^  AA\ 

'Throughout  this  thesis,  we  use  A  C  B  to  denote  “B  contains  A ”,  and  A  C  /j  to  denote  “B  strictly 
contains  A”. 
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Proof.  The  proof  starts  with  the  following  claim.  Let  M  be  a  positive-semidefinite  matrix. 
If  xTMx  >  xTx  for  every  x  G  (Ker(M))-  then  it  also  holds  that  xTM^x  <  xTx  for  every 
x  G  (Ker(M))±. 

Denote  the  SVD  of  M  —  VE2VJ  =  Y^i=  i  &jviv J,  where  vt  is  the  i-th  column  of  V. 
Fix  x  G  (. Ker(M ))±  and  observe  that  x  is  span  by  the  same  r  vectors  {iq,  v2, . . . ,  vr},  so 
we  can  write  x  =  YH=iaivi-  Denote  y  =  VT,~1VTx  =  ,  a-  'vrujx.  We  have  that 

y  =  Yli=i  a-i(TfiVi  so  y  G  ( Ker(M))x .  Therefore  yTMy  >  yTy ,  but  yJy  =  xTM^x  and 
yJ  My  =  xTx. 

Have  established  the  claim,  we  can  proceed  with  the  proof.  We  denote  the  SVD  of  A 
as  A  =  VY?VT  and  also  B  =  WT12WT.  Because  we  can  split  any  vector  x  into  the  direct 
sum  x  =  x0  +  x±  where  x0  G  Ker(A)  =  Ker(B)  and  x±  G  (Ker(A))±,  and  since  we 
have  that  the  required  inequality  holds  trivially  for  xq,  then  we  need  to  show  it  holds  for 
x±.  Given  any  z  G  (Ker(A))±,  set  y  =  1 ^E~1VTz.  We  know  that  yT Ay  <  yTBy,  and 
therefore 

^  =  yTAy  <  y1  By  =  z x  (VE-1VTWU2WTV'E- lVT)  z  d=  zTCz 

The  above  proves  that  C  is  a  positive  semi  definite  matrix  whose  kernel  is  exactly  Ker(A)  = 
Ker(B),  and  so  it  follows  from  the  previous  claim  that  zTz  >  zTCA.  Let  I\Ker(C)x 
be  the  matrix  which  nullifies  every  element  in  Ker(C),  yet  operates  like  the  identity  on 
(K er(G))-L.  One  can  easily  check  that  C t  =  VTjVJWA~2WJVTyT  by  verifying  that 
indeed  C^C  =  C'C1*  =  I\Ker(C)±-  So  now,  given  x  we  denote  z  =  1/S_1l/Tx  and  apply 
the  above  to  deduce  xTB^x  =  zrC^z  <  zTz  =  xJAlx.  □ 


2.2.3  Weyl  and  Lindskii’s  Inequality 

We  describe  below  two  useful  inequalities  regarding  the  eigenvalues  and  singular  values 
of  a  perturbation  of  a  matrix.  That  is,  we  related  the  eigenvalues  of  a  given  matrix  A  with 
the  eigenvalues  of  a  matrix  B  =  A  +  E. 

Lemma  2.2  (Weyl’s  Inequality).  Let  A  and  B  be  positive  semidefinite  matrices  s.t.  the 
matrix  E  =  B  —  A  satisfies  xTEx  >  Ofor  every  x.  Then  for  every  1  <  i  <  nit  holds  that 
the  ith  singular-values  satisfy: 

svfA)  <  svfB) 

Weyl’s  inequality  is  a  corollary  of  the  max-min  characterization  of  the  singular  values 
of  a  matrix. 
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Theorem  2.3  (Courant-Fischer  Min-Max  Principle).  For  every  matrix  A  and  every  1  < 
i  <  n,  the  i-th  singular  value  of  A  satisfies: 

svAA)  —  max  min  (Ax,x) 

S:dim(S)=i  x G5:  ||:e||=1 


Proof  of  Lemma  2.2.  Weyl’s  inequality  is  a  direct  application  of  the  theorem.  Let  5,4  be 
the  i-dimensional  subspace  s.t.  svfiA)  =  minx65A.  \\x\\=1(Ax,x).  For  every  x  E  Sa  we 
have  that 

{Ax,  x)  =  {Ax,  x)  +  0  <  {Ax,  x)  +  {Ex,  x)  =  {Bx,  x) 

so  svfiA)  <  minxesA:  \\x\\=1{Bx,x).  Thus  svfiB)  =  maxsminx6s;  \\x\\=1{Bx,x)  > 
svfiA).  □ 

Lemma  2.4  (Lindskii’s  inequality.).  Let  A  and  B  be  a  n  x  n  symmetric  matrix.  Denote 
E  =  B  —  A.  Then  for  every  k  and  every  k  indices  1  <  i\  <  i2  <  ■  ■  ■  <  ik  <  n  we  have 
that 

k  k  k 

Y  evh  ( B )  <  Y  evb  (A)  +  Y evi (E) 

3= 1  i=1  i=1 


Much  like  Weyl’s  inequality  (Lemma  2.2)  follows  from  the  Courant-Fischer  Min-Max 
principle,  Lindskii’s  inequality  follows  from  a  generalization  of  the  Courant-Fischer  prin¬ 
ciple. 

Theorem  2.5  (Wielandt’s  Min-Max  Principle.).  Let  A  be  a  n  x  n  symmetric  matrix.  Then 
for  every  k  and  every  k  indices  1  <  i\  <  i2  <  ■  ■  ■  <  ik  <  n  we  have  that 


k 


YeViM ) 

3= 1 


k 

max  min  \  {Ax,,xA 

SiCS2C...CSfc  XjeSj-.  ^  J  J 

dim(,Sj)=i?  xj  orthonormal  3—3 


Proof  of  Lemma  2.4.  To  prove  Lindskii’s  inequality,  fix  <12  <  •  •  •  <  ik  and  let 
Ti,  T2, . .  ,  s  Tk  the  subspaces  for  which 

k  k 

Y  <  >'i,  ( B )  =  mm  Y  (Bxv  xj ) 

Xj  orthonormal  3=^- 

For  every  v\,v2, . . .  ,vk  orthonormal  we  have  that  Yj=i(Bvjivj)  —  EjL 1  {Avj->vj)  + 
EjU {Evj,Vj)  <  Yj=i {Avj,vj)  +  E?=i  eVi(E),  so 

k  k  k 

Yevij(B)<  min  Y^{Axj,  Xj)  +  ^  '  cvfiE') 

Xj  orthonormal  3— 1  l— 
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and  clearly 


k  k  k 

eVi.(A)  =  max  min  >  (Ax^xA  >  min  >  (Ax^x^ 

A  ’  S1CS2C...dSk  XjGSj:  ■’  ~  XjGTji  ^  3  3 

3—1  dim (Sj)=ij  Xj  orthonormal  3—1  Xj  orthonormal  3—1 


□ 


In  fact,  Lindskii’s  inequality  holds  also  for  singular  values,  and  not  just  eigenvalues. 
This  follows  from  the  Lemma  2.4,  and  from  the  following  observation  of  Weilandt.  Given 

a  m  x  n  matrix  M ,  the  matrix  N  =  (^  ^ ^  is  symmetric  and  has  eigenvalues  which 

are  (in  descending  order): 


{sni(A),  sv2(A),  svm(A),  0,0,...,  0,  - svm(A ),  -snm_i(A), . . . ,  -sni(A)} 


2.3  Probability 

Throughout  this  thesis,  we  make  multiple  uses  of  randomness,  analysis  of  random  vari¬ 
ables,  and  multiple  concentration  bounds.  We  detail  here  their  definition  and  standard 
inequalities  regarding  the  sum  of  multiple  independent  random  variables. 


2.3.1  Types  of  Random  Variables 

2.3.1. 1  Bernoulli  Random  Variables 

Definition  2.6.  A  random  variable  X  is  said  to  be  a  Bernoulli  random  variable  if  it  takes 
the  value  1  w.p.  p  and  the  value  Ow.p.  1  —  p. 


The  mean  of  Bernoulli  random  variable  is  p  and  its  variance  is  p(l  —  p).  Naturally, 
the  mean  of  the  sum  of  n  independent  Bernoulli  random  variables  with  mean  p  is  rip,  and 
the  variance  of  the  sum  is  np(  1  —  p).  This  means  that  we  expect  that  sum  to  belong  to  the 
range  np  ±  a/ np(l  —  p). 
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2.3.1.2  Laplace  Random  Variables 


Definition  2.7.  A  random  variable  X  Lap(n,  cr)  is  said  to  be  a  Laplace  random  variable 
if  its  PDF  is 

.  \x~u\ 

PDFx(x)  =  ±e-  - 

The  mean  of  X  is  p  and  its  variance  is  2a2.  We  therefore  expect  that  X  e  p  ±  \[2a. 
In  fact,  standard  tail  bounds  give  that  for  any  1  >  5  >  0  w.p.  >  1  —  5  we  have  \x  —  p\  < 
cr  log  (1/5).  Indeed, 

POO 

Pr[|z  —  fi\  >  o  log(l/i)]  =  2  /  le-y'-iy  =  i  (-«"■"’)  |“„log(1/J)  =  6 

J y>cr  log(l/<5) 

In  this  thesis,  we  denote  X  ~  Lap(a)  in  case  X  ~  Lop(0,  cr). 


2.3.1.3  Gaussian  Random  Variables 

Definition  2.8.  A  univariate  random  variable  X  ~  J\f{p,  a2)  is  said  to  be  a  Gaussian 
random  variable  if  its  PDF  is 


PDF*M  = 

The  mean  of  X  is  /j  and  its  variance  is  a2.  X  is  said  to  be  a  normal  Gaussian  if 
X  ~  J\f( 0, 1).  Classic  concentration  bounds  on  Gaussians  give  that  Pr[(x  —  p)2  > 
log(2/5)cr2]  <  5. 

Gaussian  random  variables  abide  the  linear  combination  rule:  for  any  two  i.i.d  normal 
random  variables  s.t.  X  ~  X(/ix-  cf2x)  and  Y  ~  Mipy-  ay  ),  we  have  that  their  linear 
combination  Z  =  aX  +  bY  is  distributed  according  to  Z  ~  AT(apx  +  bpY,  a2a2x  + 
b2ay).  This  in  turn  allows  us  to  identify  a  random  variable  R  ~  JV(0,  a2)  with  the  random 
variable  crR',  where  R1  ~  A/”(0, 1). 

The  multivariate  normal  distribution  is  the  multi-dimension  extension  of  the  univariate 
normal  distribution.  X  ~  JV(p,  X)  denotes  a  m-dimcnsional  multivariate  r.v.  whose  mean 
is  p  G  M”\  and  variance  is  the  PSD  matrix  X  =  E  [(X  —  p)(X  —  p)T].  If  X  has  full  rank 
(X  is  positive  definite)  then  PDFx(a;)  =  ^  =  =  exp(— fxTX~1a;),  a  well  defined 

function.  If  X  has  non-trivial  kernel  space  then  PDFx  is  technically  undefined  (since  X 
is  defined  only  on  a  subspace  of  volume  0,  yet  fRm  PDF x(x)dx  =  1).  However,  if  we 
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restrict  ourselves  only  to  the  subspace  V  =  (K er(S))-1,  then  PDF1]  is  defined  over  V  and 
PDF^(x)  =  —j  ==y  ^  =  exp(— ^aTE^x).  From  now  on,  we  omit  the  superscript  from 

the  PDF  and  refer  to  the  above  function  as  the  PDF  of  X.  Observe  that  using  the  SVD, 
we  can  denote  E  —  U  diag (erf,  a2,  ■  ■  ■ ,  o2,  0, . . . ,  0)  UT,  and  so  V  is  the  subspace  spanned 
by  the  first  r  rows  of  U.  The  multivariate  extension  of  the  linear  combination  rule  is  as 
follows.  If  A  is  a  n  x  m  matrix,  then  the  multivariate  r.v.  Y  =  AX  is  distributed  as  though 
Y  ~  /IE /1T).  For  further  details  regarding  multivariate  Gaussians  see  [115]. 


2.3.2  Concentration  Bounds 


We  conclude  this  chapter  by  stating  4  famous  inequalities.  Their  proofs  can  be  found 
in  [12]. 

Theorem  2.9  (Markov’s  inequality).  Let  X  be  a  non-negative  random  variable.  Then  for 
every  t  >  Owe  have  that 

Pr[X  >t\<  325 

Theorem  2.10  (Chebyshev’s  inequality).  Let  X  be  a  random  variable  with  mean  //  and 
variance  a2.  Then  for  every  t  >  Owe  have  that 

2 

Pr[|A'  -n\>t\<f 

Theorem  2.11  (Chernoff-Hoeffding  bounds).  Let  X\.  X2. ,  Xn  be  iid  random  variables 
taking  values  in  the  range  [a,  b]  and  with  mean  fi.  Then  for  every  t  >  0  we  have 


Pr 


ll 


>  t 


<  2exp(— 2 nj£^) 


and  for  every  5  >  0  we  have 


Pr 


n 


E-v-'* 


i 


>  5fi 


<  2  exp (— |n52/i) 


Theorem  2.12  (Azuma’s  inequality).  Let  X0,Xi,  X2, . . . ,  Xn  be  a  sequence  of  random 
variables  satisfying  that  for  every  i  >  1  we  have  E  \ X,  |  A", , . . . ,  Xj_i]  =  X,_  |  and  that 
there  exists  a  c  >  0  s.t.  Pr]  AT  —  Xt_\  <  c\  —  1.  Then, 

Pr  [^\Xn  —  X0\  >t]  <  2 exp 
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Chapter  3 

Background:  Differential  Privacy 


Consider  a  scenario  in  which  a  trusted  curator  gathers  personal  information  from  n  indi¬ 
viduals,  and  wishes  to  release  statistics  about  these  individuals  to  the  public  without  com¬ 
promising  any  individual’s  privacy.  Differential  privacy  [63]  provides  a  robust  guarantee 
of  privacy  for  such  data  releases.  It  guarantees  that  for  any  two  neighboring  databases 
(databases  that  differ  on  the  details  of  any  single  individual),  the  curator’s  distributions 
over  potential  outputs  are  statistically  close. 

Definition  3.1  ([62]).  An  algorithm  ALG  which  maps  databases  into  some  range  ALG  : 
'D  — >  TZ  is  said  to  preserve  (e,  5)  -differential  privacy  if  for  all  pairs  D  and  D1  of  databases 
that  differ  on  the  details  of  a  single  individual,  and  for  all  subsets  S  C  'R,  it  holds  that 

Pr[ALG(D)  G  5]  <  e£Pr[ALG (D)  G  5]  +  5 
We  say  ALG  presents  e-differential  privacy  if  it  presents  (e,  0)  -differential  privacy. 

By  itself,  preserving  differential  privacy  isn’t  hard,  since  the  curator’s  answers  to  users’ 
queries  can  be  so  noisy  that  they  obliterate  any  useful  data  stored  in  the  database.  There¬ 
fore,  the  key  research  question  in  this  field  is  to  provide  tight  utility  and  privacy  tradeoffs. 


3.1  The  Basic  Mechanism 

The  most  basic  technique  that  preserves  differential  privacy  and  gives  good  utility  guar¬ 
antees  is  to  add  relatively  small  Laplace  or  Gaussian  noise  to  a  query’s  true  answer.  We 
use  V  to  denote  the  set  of  all  possible  datasets,  where  we  view  a  dataset  as  a  multiset  of 
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elements  taken  from  some  domain  X .  We  say  two  datasets  D,D'  E  D  are  neighbors  if 

they  differ  on  the  details  of  a  single  individual.  We  denote  the  fact  that  D'  is  a  neighbor  of 

D  using  D1  ~  D.  We  define  the  distance  d(D,  D ')  between  two  databases  D,  D'  E  D  as 

the  minimal  non-negative  integer  k  s.t.  there  exists  a  path  D0,  D i, . . . ,  Dk  where  D0  =  D, 

Dk  =  D'  and  for  every  1  <  i  <  k  we  have  that  I),-\  ~  D,.  Alternatively,  we  can  say 

that  d{D,  D')  =  k  if  \D  \  D'\  =  k.  Given  a  subset  V  C  V  we  denote  the  distance  of  a 

database  D  to  V  as  d(D,  V)  =  min  d(D,  D'). 

D'ev 

Definition  3.2  ([63]).  The  global  sensitivity  of  a  query  function  f  :  V  — >•  M  is  GSj  = 
ma xD^D'  | f(D)  -  f(D')\. 

For  any  given  statistic  or  query,  its  global  sensitivity  measures  the  maximum  difference 
in  the  answer  to  that  query  over  all  pairs  of  neighboring  data  sets.  More  importantly,  as 
the  next  theorem  illustrates,  global  sensitivity  provides  an  upper  bound  on  the  amount  of 
noise  that  has  to  be  added  to  the  actual  statistic  in  order  to  preserve  differential  privacy. 

Theorem  3.3  ([63,  62]).  Given  a  query  function  f  :  T>  — >  M,  the  mechanism  that  outputs 
f(D)  +  X  is 


•  e-differentially private,  when  X  ~  Lap(GS(f)/e ) 

•  (e,  5) -differentially privacy,  when  X  ~  J\f( 0,  GS(f )  •  2  log(2/5)/e2). 

The  basic  mechanism  provides  good  utility  and  privacy  tradeoffs  for  answering  a  single 
query  of  small  global  sensitivity.  However,  this  leaves  two  questions  for  future  research: 

(i)  can  we  give  any  reasonable  utility  guarantees  for  queries  of  large  global  sensitivity,  and 

(ii)  how  well  can  we  answer  multiple  queries.  We  start  by  address  the  latter  issue,  and  only 
after  surveying  multiple  mechanisms  with  better  dependence  on  the  number  of  queries  we 
will  return  to  the  first  issue. 

It  is  straight  forward  to  see  that  applying  a  (e,  5) -differentially  private  mechanism  in 
order  to  answer  t  queries  preserves  (te,  1 5)  -differential  privacy.  However,  the  follow¬ 
ing  theorem  gives  better  bounds  (roughly  stated  as  answering  t  queries  while  preserving 
(e  •  O(Vt),  8  ■  0(f)) -differential  privacy). 

Theorem  3.4  ([65]).  Fix  e,  5,  8'  >  0.  Let  Mi, . . . ,  Mt  be  t  mechanisms  that  present 
{e,  8) -differential  privacy.  Then  the  concatenation  of  all  mechanisms  presents  (V:*,  8*  )- 
differential  privacy,  for 

e*  =  e  ■  sjTt  ln(l/<5')  +  fe(e£  —  1),  8*  —  t8  +  8' 
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The  proof  of  Theorem  3.4  uses  the  KL-divergence.  First,  Dwork  et  al  show  that  given 
two  random  variables  A",  X'  s.t.  for  every  x  e  Supp(X)  it  holds  Pr  [X  —  x]  <  eePr  [X'  = 
x\,  it  holds  that 


E 


X 


.  Pr[X  =  x 

In - 

Pr[X'  =  x 


<  e(e£ 


1)  <  2e2 


Secondly,  Dwork  et  al  use  Azuma’s  inequality  to  show  that  for  X±,  X2, . . . ,  X,  and  X\.  X'2. . . . ,  X[ 
s.t.  for  every  i  the  same  holds  for  Xt  and  X[,  we  have  w.h.p  that 


Sln 


PrjXj 

Pr[X> 


J2ln 


PrjXj 

Pr[X' 


x ] 

x] 


<  y/2t  ln(l/5') 


And  finally  they  show  that  for  every  Y.  Y'  satisfying  Pr[P  —  x]  <  e€Pr[P/  =  x]  +  S  there 
exists  X,  X'  such  that  max  {|Pr[F  —  x\  —  Pr[A"  =  x]| ,  iPrfP'  —  x]  —  Pr[X'  =  x]|}  < 
6  and  that  Pr[X  =  x]  <  eePr[X'  =  x] 


3.2  Alternative  Mechanisms 

As  Theorem  3.4  suggests,  when  answering  t  queries  (all  with  small  global  sensitivity) 
by  naively  applying  the  basic  mechanism  to  answer  each,  in  order  to  preserve  (e,  d)- 
differential  privacy,  we  need  to  add  random  noise  of  roughly  0{y/t/e)  to  each  query.  It 
turns  out  that  there  exist  other  scenarios  where  one  can  improve  the  error’s  dependency  on 
t  to  a  logarithmic  dependency.  Below,  we  detail  mechanisms  that  achieve  such  logarithmic 
dependency.  As  we  show,  this  improvement  often  comes  at  the  expense  of  computational 
efficiency. 

3.2.1  The  Exponential  Mechanism 

McSherry  and  Talwar  [112]  introduced  the  exponential  mechanism,  that  preserves  e-differential 
privacy.  In  fact,  it  is  somewhat  of  a  general  scheme,  which  can  be  tweaked  based  on  its 
application. 

Theorem  3.5  ([112]).  Let  1Z  be  a  discrete  range  of  desired  outputs,  and  let  s  be  any 
scoring-function  s  :  V  x  1Z  — »  M.  Given  a  database  D,  let  the  weight  of  an  element  r  e  72. 
be  defined  as 

Hr)  =  exp  (~2 

Then  the  mechanism  that  outputs  r  w.p.  oc  w(r)  preserves  e-differential  privacy. 
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Proof.  Given  any  neighboring  D  and  D',  for  every  r  we  have  that  e  e/2  <  v'o(r)  fiwjy  (r)  < 
e£/2.  Therefore,  we  have  that  for  every  r 


PrM  D\  =  WD(r )  /Erett  W^(r)  <  ee/2+e/2 

Pr[r\D'}  wDfr)'  ^2re1lwDfr)  ~ 


□ 


McSherry  and  Talwar  applied  the  exponential  mechanism  to  problems  in  mechanism 
design  (after  pointing  out  to  the  fact  that  applying  an  e-differential  privacy  mechanism  to  a 
mechanism-design  problem  guarantees  e-truthfulness).  Their  work  was  the  first  to  explore 
the  connection  between  privacy  and  mechanism  design,  a  rapidly  growing  field  nowadays 
(see  survey  by  Pai  and  Roth  [122]).  But  the  exponential  mechanism  allows  for  much 
more  -  using  the  exponential  mechanism,  Blum  et  al  [43]  have  shown  that  there  exists  an 
algorithm  (non  necessarily  efficient)  to  privately  release  a  sanitized  dataset  that  answers  a 
very  large  set  of  queries  Q  simultaneously. 


Theorem  3.6  ([43]).  Let  D  be  a  dataset  whose  elements  come  from  a  domain  X.  Let  Q 
be  a  predefined  set  of  functions  q  :  X  — »  {0,1}.  For  every  q  E  Q,  let  fq  \  D  —}  [  0, 1] 
be  the  corresponding  fractioncd  counting  query  fq ( I) )  =  rj^  |{.x  £  I):  (fix)  =  1}  |.  Fix 

1/2  >  a,j3  >  0,  and  define  m  =  201°^ipQ)  dim vc{Q)- 


Then  for  any  e  >  0,  applying  the  exponential  mechanism  over  the  range  TZ  —  Xr‘ 
of  databases  of  size  m,  using  the  scoring  function  s(D,D )  =  maxg£ q  fq ( I) )  —  fq(D) 
outputs  a  database  of  size  m  s.t.  with  probability  >1  —  (3  it  holds  that  for  every  q  G  Qwe 


have  that 


fq(D)  -  fq(D) 


<  a,  provided  that 


D\  >  ±  («yadimTO(e)ln(|A-|)+ln(l//3)) 


Proof  Preserving  e-differential  privacy  is  a  consequence  of  the  exponential  mechanism. 
To  prove  utility,  observe  first  that  GS(s )  =  1.  Observe  also  that  standard  techniques 
regarding  the  VC-dimension  guarantee  the  existence  of  a  dataset  D*  for  which  the  score 
is  <  a/2.  Hence,  the  ratio  of  the  weight  of  this  one  good  dataset  to  the  total  weight  of  all 
bad  datasets  (for  which  the  score  >  a)  can  be  bounded  using  the  size  of  D  by 


exp(— ea\D\/4) 
Pr [Bad  dataset]  ~  \X\m exp(—ea\D\/2) 


Pr  [D* 


> 


=  exp 


f\D\ 


mlog(lT’D)  >  i  ^ 


0 


□ 
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The  result  of  [43]  was  the  first  theoretical  work  to  show  the  existence  of  non-interactive 
privacy  preserving  mechanisms  (though  the  data-mining  community  have  used  the  tech¬ 
nique  of  Randomize-Response  for  privacy,  as  we  discuss  below).  Such  mechanism  have 
the  benefit  that  they  do  not  need  to  interact  with  a  querier  and  can  simply  release  infor¬ 
mation  that  will  answer  any  of  the  querier’s  potential  questions.  However,  the  obvious 
problem  with  the  exponential  mechanism  is  its  intractability  -  in  most  reasonable  applica¬ 
tions  Xm  is  not  poly-time  tractable.  In  fact,  there  have  been  several  works  [64,  136,  135] 
proving  that  for  the  case  of  X  =  (0,  l}d,  a  sanitized  dataset  cannot  be  output  in  poly-time 
under  certain  cryptographic  assumptions. 


3.2.2  Randomized  Response 


The  simple  technique  of  decentralized  input  perturbation,  or  Randomized-Response  [141], 
gives  a  straight-forward  way  of  preserving  e-differential  privacy. 

Theorem  3.7.  Let  D  be  a  database  whose  elements  are  taken  from  a  finite-size  domain 
X  of  size  T.  Given  0  <  e  <  1,  the  algorithm  that  for  every  entry  x  G  D  (recall,  D  is  a 
multiset)  picks  y(x)  independently  where 

fx,  w.p. 

y(x )  =  <  ,  if  /  / 

I  or,  w.p.  jr—,  for  any  x  f  x 

and  publishes  D  =  {y(x)  :  x  G  I)},  is  a  e-differential  privacy  preserving  mechanism.1 
Furthermore,  fix  S  C  {1,  2, ... ,  n}  and  define  the  multiset  D\s  —  (A;  :  i  G  S}.  Then 
w.p.  >  1  —  (3  we  can  use  D  to  estimate  the  fraction  of  elements  in  I)  |  g  that  are  of  type  x 

up  to  an  additive  error  of  0  (A 


Proof.  The  fact  that  the  above-mentioned  algorithm  preserves  e-differential  privacy  is 
straight-forward.  Given  two  neighboring  datasets  D  and  D'  that  differ  on  the  i-th  entry,  it 
is  easy  to  see  that  for  any  y  G  X  we  have 


Pi- [A  =  y\D\ 
Pr [A  =  y\  D'] 


<  1  +  e  <  e£ 


Now,  fix  S.  For  every  x  G  X,  we  denote  px  as  the  fraction  of  entries  in  D\s  that  are  of  type 
x.  A  straight-forward  calculation  gives  that  in  D  we  expect  to  see  a  fraction  of  fix  =  lXefx 


'A  slightly  less  elegant  version  picks  y(x)  =  x  w.p.  T_\+ee  and  y(x)  =  x'  for  any  other  x'  f  x  w.p. 
1 

T-l+ee  ’ 
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entries  of  type  x.  Due  to  standard  Hoeffding  and  union  bounds  it  holds  that  w.p.  >  l— (3  the 
number  of  elements  in  D\s  of  type  x,  denoted  nx,  satisfies  \^nx  —  px\  < 
for  all  x  €  X.  Therefore,  using  the  estimation 

A=K(T+e)M-1) 


we  have  that  \px  -  px\  <  □ 

Observe  that  Randomized-Response  allows  us  to  estimate  queries  that  are  based  on  the 
histogram  induced  by  D  on  multiple  predefined  subsets  of  entries  Si,  S2,  ■  ■  ■ ,  St  (which  we 
could  think  of  as  t  predefined  queries),  with  the  error  bound  on  each  estimation  scaling 

like  0{\J l-c-’  the  error  bound  grows  by  only  a  factor  of  log (7;)  compared  to  the 
bound  we  have  for  answering  a  single  query.  However,  this  technique  provides  vacuous 
error  bounds  when  T  =  |Af|  is  large.  Furthermore,  given  a  fixed  S,  the  error  bounds  pro¬ 
vided  by  the  Randomize-Response  mechanism  are  inferior  to  the  bounds  provided  by  the 
mechanism  that  perturbs  the  histogram  of  D\s  by  adding  Lap(l/en)  noise  to  the  count 
of  each  type  independently.  However,  the  main  benefit  of  the  Randomized-Response  al¬ 
gorithm  is  that  it  is  a  decentralized  algorithm  -  each  individual  randomizes  her  own  data 
and  provide  the  data-curator  with  the  perturbed  data  (as  opposed  to  previous  mechanisms, 
in  which  the  data-curator  gets  to  observe  the  actual  data  but  releases  it  with  some  noise). 
Therefore,  the  Randomized-Response  mechanism  is  applicable  in  the  case  that  the  agents 
don’t  trust  even  the  data-curator. 

The  Randomized-Response  mechanism  was  first  used  for  data-mining,  actually  applied 
for  voting  for  one  of  two  options  w.p.  (2/3, 1/3}.  It  was  first  analyzed  theoretically  by 
Kasiviswanathan  et  al  [101],  and  was  also  used  for  privately  estimating  the  size  of  cuts  in 
a  graph  by  Gupta  et  al  [77]. 


3.2.3  The  Multiplicative  Weights  Mechanism 

Lastly,  we  conclude  the  survey  of  existing  mechanisms  with  the  Multiplicative  Weights 
mechanism  of  Hardt  and  Roth  [80].  This  mechanism  builds  on  the  previous  work  of  Roth 
and  Roughgarden  [128],  who  were  the  first  to  consider  the  case  of  queries  as  linear  opera¬ 
tors.  In  the  Multiplicative  Weights  mechanism,  the  database  is  viewed  as  a  histogram  over 
X  (normalized  to  1,  for  convenience),  so  V  =  {v  e  vr  =  1}.  Queries  are  linear 
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queries  in  [0, 1]^,  and  so  the  answer  to  a  query  q  is  (q.  D).  Clearly,  for  any  q,  the  global 
sensitivity  of  q  is  upper  bounded  by  1  jn. 

Algorithm  1:  Multiplicative  Weights  Mechanism 
Input:  A  n- size  database  D,  privacy  parameters:  M>0,  failure  probability  f3  >  0, 
number  of  overall  queries  to  answer:  t. 

i  Set  parameters  •  U  -O  (  rn  V/|og(|A’l)  ^  a  -  O  ( J  Vlog(-^ 

1  bet  parameters.  u  —  u  I  eniog(i/s)iog(t/p)  y  °  ~  u  l  y  en  log \t/p)  I  » 

a  =  0(a  log  (t//3))  =  O  1°g(1/^)  log(*/ff)^  • 

2  Initialize  vector  v  =  -1,  and  i  0. 

3  foreach  user  query  q  do 

4  Sample  Z  rs_/  Lap(a). 

5  if  |(g,  D)  +  Z  —  {q,v)\  <  a/2  then 

6  Answer  q  with  (q,v).  //  Easy  case 

7  else 

8  i  4—  i  +  1.  Abort  if  i  >  U  ■  log(l/5). 

9  Answer  q  with  (q,  D)  +  Z.  //  Hard  case 

10  if  ( q ,  77)  +  Z  <  (g,  r)  then 

11  Let  i/  be  a  normalized  histogram  satisfying  that  Vi  we  have 

n-  oc  n*  exp  (-^). 

12  else 

13  Let  V  be  a  normalized  histogram  satisfying  that  Vi  we  have 
r'  oc  exp  (-^(1  -  ft)). 

14  Update  v  <—  v' . 


Theorem  3.8  ([80,  77]).  Algorithm  1  preserves  (e,  5) -differential privacy,  and  w.p.  >  1 —f3 
answers  all  user  queries  up  to  an  additive  error  of  a. 

The  analysis  of  the  Multiplicative  Weight  mechanism  is  quite  intricate  (and  it  was  later 
improved  and  generalized  by  Gupta  et  al  [77]).  It’s  main  outline  is  as  follows. 

1.  W.p.  1  —  f3  we  have  that  in  no  query  \Z\  >  a/4. 

2.  Fix  two  neighboring  datasets,  D  and  D' .  Given  a  query  q,  we  partition  [—a/4,  a/4] 
into  2  regions:  “safe”  in  which  we  update  neither  D  nor  D'\  and  its  complimen¬ 
tary  “non-safe”.  We  further  partition  the  “non-safe”  region  into  “must-update”  -  in 
which  we  update  both  D  and  D',  and  its  complimentary  “unknown”. 
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3.  Conditioned  on  Z  being  “safe”,  there  is  no  privacy  loss.  This  is  the  crux  of  the 
analysis. 

4.  Conditioned  on  Z  being  “non-safe”,  then  there’s  a  constant  probability  Z  is  a  “must- 
update”.  Conversely,  w.p.  >  1  —  5/2  we  have  that  the  number  of  times  in  which  Z 
is  “non-safe”  is  at  most  #Updates  •  0(log(l/<5)). 

5.  Since  v  is  update  using  the  standard  Multiplicative  Weights  algorithm  [108],  then 
after  U  updates  we  have  that  |  (q,  v)  —  (q,  D)  \  <  a/2  for  every  q.  So  at  that  point,  no 
more  updates  are  made.  Therefore,  we  answer  all  t  queries  with  an  error  of  at  most 
a. 

6.  Finally,  we  appeal  to  the  standard  composition  arguments  (Theorem  3.4)  and  Azuma’s 
inequality  to  show  that  in  the  0(U  log(l/<5))  times  in  which  Z  is  “non-safe”  and  we 
applied  the  Laplace  mechanism  with  std  a  incurred  an  overall  privacy  loss  of  at  most 
e  w.p.  >1—5/2. 


It  is  obvious  that  the  counting  queries  setting  considered  by  Blum  et  al  [43]  is  a  private 
case  of  the  Multiplicative  Weights  mechanism:  Given  a  query  q  Q  we  can  identify  it 
with  the  vector  vq  e  {0, 1}I'Y  where  vq(x)  =  q(x)  for  every  x  €  X,  and  now  the  counting 
query  is  precisely  (vq,  D).  And  so,  Blum  et  al  [43]  give  a  e-differentially  private  mecha¬ 
nism  that  can  answer  each  query  up  to  an  additive  error  of  1  /(en)1/3  (neglecting  other  pa¬ 
rameters).  In  contrast,  the  Multiplicative  Weights  mechanism  gives  a  (e,  5) -differentially 
privacy  mechanism  (a  weaker  guarantee)  with  an  additive  error  of  1/ \fen  (a  tighter  util¬ 
ity  bound).  The  alternative  analysis  of  the  Multiplicative  Weights  mechanism  given  by 
Hardt  et  al  [81]  shows  that  as  an  e-differentially  private  algorithm,  it  gives  error  bounds 
of  0(l/(en)1/3).  It  is  an  open  question  to  see  whether  a  e-differentially  private  algorithm 
exists  whose  utility  guarantee  is  proportional  to  1/ \Jn. 

Observe  also  that  the  Multiplicative  Weights  mechanism  is  often  intractable  -  since  an¬ 
swering  each  query  takes  poly(|A|)  time.  But  even  in  applications  where  X  has  poly-size, 
if  we  wish  to  answer  a  large  set  of  queries  Q  then  the  mechanism  requires  us  to  traverse 
all  queries  in  Q  and  so  it  runs  in  time  proportional  to  \Q\.  And  yet  -  the  analysis  of  the 
Multiplicative  Weights  mechanism  guarantees  the  existence  of  a  small  (poly-size)  num¬ 
ber  of  queries  for  which  you  update  the  vector  v  and  afterward  it  is  never  again  updated. 
Therefore,  one  potential  venue  for  improving  its  running  time  is  to  find  this  small  set  of 
queries,  apply  the  mechanism  solely  on  them,  and  release  the  resulting  v.  This  is  precisely 
what  Hardt  et  al  [81]  do,  by  picking  queries  according  to  the  exponential  mechanism  (since 
the  set  of  queries  you  select  for  the  update  might  leak  privacy  as  well).  Needless  to  say, 
applying  the  exponential  mechanism  over  the  queries  in  Q  takes  poly ( |  Q  | )  time.  It  is  an 
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open  question  of  finding  a  scenario  (say,  cut-queries)  where  we  can  find  this  small-size  set 
of  update-queries  efficiently. 

3.2.4  Our  Contribution:  The  Johnson  Lindenstrauss  Mechanism 

In  Chapter  5  we  detail  a  different  mechanism  that  allows  one  to  answer  multiple  queries, 
with  error  bound  growing  only  logarithmically  with  the  number  of  queries.  This  mecha¬ 
nism  is  actually  a  well-known  one:  the  Johnson-Lindenstrauss  transform.  We  show  that 
given  a  database  represented  as  a  nxd  matrix  A,  the  mechanism  that  picks  a  random  matrix 
R  of  size  r  x  n  in  which  every  entry  is  chosen  independently  from  a  normal  Gaussian  and 
publishes  the  multiplication  RA,  preserves  (e,  5) -differential  privacy,  if  all  singular  values 
of  A  are  sufficiently  large.  Furthermore,  the  Johnson-Lindenstrauss  transform  guarantees 
that  for  any  query  vector  x  we  have  that  w.h.p  ||  Ax\\2  ~  j 1 1  RAx 1 1 2 .  Therefore,  this  mech¬ 
anism  allows  us  to  answer  variance  queries  -  where  the  user  poses  a  direction  (unit  vector) 
x  and  queries  about  the  variance  of  the  data  along  direction  x.  In  the  special  case  where 
A  is  the  edge-matrix  of  a  graph,  this  mechanism  allows  the  user  to  estimate  the  values  of 
cuts  -  the  user  specifies  a  set  of  vertices  S  and  gets  an  estimation  of  the  value  of  the  cut 

\E(S,S)\. 

3.3  Smooth  Sensitivity:  Answering  Queries  of  Large  Global 
Sensitivity 

Recall  that  the  basic  mechanism  (Theorem  3.3)  gives  good  utility  guarantees  only  for 
queries  of  small  global  sensitivity.  In  contrast,  when  the  global  sensitivity  is  large  (for  ex¬ 
ample,  in  the  extreme  case  where  there  exists  D i  ~  I)>  such  that  / ( 1)  \ )  =  mino^v  f(D ) 
and  f(D2 )  =  maxD6p  f(D)),  this  mechanism  may  have  vacuous  utility  guarantees.  Still, 
there  could  be  instances  D  e  V  for  which  the  local  sensitivity  is  small. 

Definition  3.9.  The  local  sensitivity  ofa  query  f  at  a  dataset  D  is  LSf(D)  =  max  |  f(D)  —  f(D')  |. 

D'~D 

The  local  sensitivity  LSf(D)  may  be  significantly  lower  than  the  global  sensitivity 
GSf.  However,  adding  noise  proportional  to  LSf(D)  does  not  preserve  differential  pri¬ 
vacy  because  the  noise  level  itself  may  leak  information.  A  clever  way  to  circumvent  this 
problem  is  to  smooth  out  the  noise  level  [119]. 
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Definition  3.10  ([119]).  A  /3-smooth  upper  bound  on  the  local  sensitivity  of  a  query  f  is 
a  function  Sf i(g  which  satisfies  (i)  V/J  e  T>,  SjjfiD)  >  LSf(D),  and  (ii)  WD,D'  e  V  it 
holds  that  Sf^(D)  <  exp (/3  ■  d(D ,  D'))Sf^{D'). 


Indeed,  it  is  possible  to  preserve  privacy  while  adding  noise  proportional  to  a  /3-smooth 
upper  bound  on  the  sensitivity  of  a  query. 

Theorem  3.11  ([119]).  Given  a  query  function  f  :  T>  — >  M,  the  mechanism  that  for  a 

2s  ( 

given  D  outputs  f(D)  +  Z  with  Z  ~  Lap{ — Ar — -)  and  with  /3  =  lne(2/5)  preserves 

(e,  5) -differential  privacy. 


Proof  Fix  any  two  neighboring  datasets  D  and  D' .  Let  Z,  be  random  noise  sampled  from 
Lapi 2,S/'/ —  )  and  Z2  be  random  noise  sampled  from  Lapi  1S}-F><'D  j  j.  Calculating  the  ratio 
between  the  two  PDFs,  we  have 
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Because  of  the  definition  of  /3  we  have  that  w.p.  1  —  <5/2  it  holds  that  W  <  SfAD')//3, 
so  w.p.  >  1  —  5/2  we  have  that  the  ratio  of  the  two  PDFs  is  bounded  by 


pdfZiO£)  <  e/2 
PDF z2(x)  ~ 


So  now  fix  any  S  cK  and  we  have 


6  (f(D)-f(D')) 

Pr  [f(D)  +  ZleS}=  Pr[Zi  eS-  f(D )]  <  e  2Sf^D)  pr[^  e  s  -  f{D')\ 
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<  ee/2Pr  [Zi  eS-  f(D')} 

<  e£/2  (e£/2Pr[Z2  eS-  /(£>')]  +  f) 

<  e£Pr [/(£>')  +  Z2  e  S]  +  5 


□ 

Clearly,  to  compute  the  output  of  this  mechanism  efficiently  one  must  present  a  tractable 
algorithm  to  compute  the  /3-smooth  upper  bound  Sft@(G),  a  task  which  is  by  itself  often 
non-trivial. 

3.3.1  Our  Contribution:  Restricted  Sensitivity 

In  Chapter  6  we  introduce  the  notion  of  restricted  sensitivity ,  which  is  complementary  to 
the  notion  of  smooth-sensitivity.  Smooth  sensitivity  depends  on  the  input  database  -  given 
D,  the  answer  to  a  user’s  query  /  depends  on  the  values  /  takes  on  D  and  the  neighborhood 
of  D.  In  contrast,  restricted  sensitivity  depends  on  a  user-defined  set  of  “good”  instances 
H  C  T>,  so  that  given  a  query  /  its  answer  depends  on  the  values  /  takes  on  H.  We  design 
a  (e,  5) -differentially  private  mechanism  that  given  /  answers  it  with  fairly  small  noise  if 
it  is  indeed  the  case  that  D  e  H,  however,  this  mechanism  has  no  guarantees  if  it  is  the 
case  that  D  £  H. 

We  illustrate  the  mechanism  in  a  particular  setting  which  is  of  growing  importance 
nowadays  -  answering  queries  regarding  social  networks.  Many  natural  questions  one 
might  ask  given  a  social  network  have  large  global  sensitivity  or  even  large  local  sensi¬ 
tivity.  (E.g.  the  answer  to  the  question  “how  many  people  are  friends  with  a  person  from 
CMU”  may  change  significantly  if  Justin  Beiber  relocates  to  Pittsburgh.)  We  illustrate 
the  use  of  applying  the  restricted  sensitivity  mechanism  to  such  queries  focusing  on  the 
set  Hk  of  social  networks  in  which  each  person  has  at  most  k  friends.  Indeed,  whereas 
the  global  sensitivity  of  profile  queries  is  fi(n),  their  restricted  sensitivity  over  Hk  is  only 
0(k).  (Continuing  with  our  example  -  focusing  only  on  the  subnetwork  of  computer  sci¬ 
entists,  which  are  naturally  inclined  to  have  a  limited  number  of  friends,  using  restricted 
sensitivity  we  can  find  out  in  a  differentially  private  manner  whether  people  from  CMU 
are  friendlier  than  the  average  person.) 
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Chapter  4 

Background:  Center-Based  Clustering 


Problems  of  clustering  data  arise  in  a  wide  range  of  different  areas  -  clustering  proteins 
by  function,  clustering  documents  by  topic,  and  clustering  images  by  who  or  what  is  in 
them,  just  to  name  a  few.  Generally  speaking,  the  goal  of  clustering  is  to  partition  n 
given  data  objects  into  k  groups  that  share  some  commonality.  From  a  machine  learning 
perspective,  the  goal  of  clustering  is  to  label  each  datapoint  with  one  of  k  distinct  labels. 
Yet,  as  opposed  to  the  classical  machine-learning  labeling  tasks  which  are  often  done  using 
access  to  a  few  labeled  examples,  clustering  is  often  done  in  a  non- supervised  setting  -  i.e., 
without  viewing  the  labels  of  any  of  the  datapoints.  Instead,  we  are  given  access  to  a  metric 
over  the  given  dataset  S. 

Definition  4.1.  A  non-negative  function  d  :  S  x  S  — >•  M>o  is  called  a  metric  if  it  satisfies 
the  following  3  properties. 

•  (Reflexivity)  For  any  x.  y  e  S  it  holds  that  d(x.  y)  =  0  iff  x  =  y. 

•  (Symmetry)  For  any  x,y  G  S  it  holds  that  dfx,  y)  =  d(y,  x). 

•  ( Triangle  inequality )  For  any  x,y,z  G  S  it  holds  that  d(x,  z )  <  d(x,  y)  +  d{y,  z). 


4.1  Center-Based  Clustering  Objectives 

Operationally,  clustering  is  often  performed  by  optimizing  some  natural  objective  over  the 
given  metric.  The  focus  of  this  thesis  is  on  center-based  objectives,  such  as  the  popular  k- 
median,  k- means  and  /c-centers,  in  which  we  measure  a  ^-partition  by  choosing  a  special 
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point  for  each  cluster,  called  the  center ,  and  define  the  cost  of  a  clustering  as  a  function  of 
the  distances  between  the  data  points  and  their  respective  centers. 

Definition  4.2.  The  problem  of  finding  a  k-partition  C  =  {C\ ,  C2,  < . . ,  C\. }  of  the  given 
dataset  and  finding  k  centers  Ci,  c2, . . . ,  ck  such  that 

•  we  minimize  JT  ^2xeC.  d(x,  c, )  is  called  A; -median. 

•  we  minimize  JT  Y2xec  d2(x,  c, )  is  called  A;-mcans. 

•  we  minimize  max,  max.,.GQ  d(x,  c, )  is  called  A  -ccntcr. 

From  here  on  out  we  denote  the  optimal  clustering  as  C*  =  {C{ ,  C£, . . . ,  C^},  its 
respective  centers  as  c\,  c^, . . . ,  c*k,  and  its  cost  as  OPT. 

It  is  simple  to  see  that  given  a  list  of  k  centers,  the  A; -partition  that  minimizes  the 
abovementioned  costs  is  to  assign  each  point  to  its  nearest  center.  This  partition,  in  which 
C,  is  the  set  of  points  that  prefer  cr  to  any  other  c?  is  known  as  the  Voronoi  partition  of 
the  metric  space,  and  Ct  is  called  the  Voronoi  region  of  q.  Conversely,  given  a  ^-partition 
{Ci,  C2, ....  6'/,.},  it  is  simple  to  find  the  best  center  point  ct  for  the  points  in  C,  in  a  finite 
metric  space  (by  brute  forcing  trying  all  points  in  Cf).  In  the  special  case  of  Axmeans  in 
Euclidean  space,  it  is  a  known  fact  that  the  best  center  point  is  the  centroid  of  Ci,  denoted 
p{Ci)  =  j^rj  Yhx^Ci  x •  fact’  given  a  set  X  of  points  in  Rd,  we  treat  p  as  an  operator 
where  p(X)  =  ]^'£xeXx- 

In  this  thesis,  we  view  k  as  part  of  the  input  and  not  a  constant,  though  even  the 
2-means  problem  in  Euclidean  space  was  recently  shown  to  be  NP-hard  [57].  Indeed, 
center-based  clustering  in  finite  metrics,  like  many  other  Axpartition  problems  (e.g.  Max 
Axcoverage,  Knapsack  for  k  items,  maximizing  social  welfare  in  A;-items  auction),  can  be 
easily  solved  in  nk  time.  (And  Kumar  et  al  [105]  give  a  PTAS  for  the  Axmeans  problem  in 
Euclidean  space.)  We  also  comment  that  there  is  a  variety  of  applications  where  k  is  quite 
large.  This  includes  problems  such  as  clustering  images  by  who  is  in  them,  clustering 
protein  sequences  by  families  of  organisms,  and  problems  such  as  deduplication  where 
multiple  databases  are  combined  and  entries  corresponding  to  the  same  true  entity  are  to 
be  clustered  together  [52,  117]. 

All  objectives  are  known  to  be  NP-hard.  For  Axmedian  in  a  finite  metric,  there  is  a 
known  (1  +  l/e)-hardness  of  approximation  result  [90]  and  substantial  work  on  approx¬ 
imation  algorithms  [76,  48,  18,  90,  60,  106],  with  the  best  guarantee  of  a  (1  +  x/3  +  e)- 
approximation.1  For  A;-means  in  a  Euclidean  space,  there  is  also  a  vast  literature  of  approx- 

1  There  is  also  a  PTAS  known  for  low-dimensional  Euclidean  spaces  (dimension  at  most  log  log  n)  [14, 
79]. 
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imation  algorithms  [120,  24,  60,  67, 79,  98]  with  the  best  guarantee  a  (9+e)-approximation 
if  polynomial  dependence  on  k  and  the  dimension  d  is  desired.  For  the  ^-center  problem 
there’s  a  known  greedy  algorithm  that  yields  a  2-approximation  [74,  85],  as  well  as  a 
2  —  e  hardness  of  approximation  result  [87].  In  the  Euclidean  plane  with  L2  or  L0 0  metric, 
there’s  a  PTAS  whose  running  time  is  (){n  log(fc)  +  (. k/e )°(fcl  1/d'))  [5]. 

4.1.1  Other  Clustering  Techniques 

In  this  introduction,  we  do  not  aim  to  survey  all  non-center-based  clustering  techniques. 
We  mention  just  a  few:  min-sum  clustering  -  in  which  the  goal  is  to  find  a  /c -clustering  that 
minimizes  JA  Yhx  y&Ci  d(x->  v)'  max  ^‘cut  -  m  which  the  goal  is  to  find  a  /^-clustering  that 
maximizes  Yhi<3  Yhx  e  Cj,t/  G  Cjd(x,y)\  and  normalized  cuts  -  in  which  one  aims  to 
minimize  JA  Jry  YhxeCi  y^c,  (^x-  v)  or  one  °*  ^e  other  many  variations  of  this  objective 
(e.g,  conductance).2  However,  in  this  thesis  we  will  make  use  of  the  Single-Linkage  algo¬ 
rithm,  and  therefore  it  is  worth-while  to  mention  the  agglomerative  clustering  algorithms, 
that  use  the  bottom-up  approach. 

Definition  4.3.  The  clustering  algorithm  that  starts  with  n  singleton  clusters  and  repeat¬ 
edly  merges  the  two  clusters  X,  Y  that  minimize  f(X.  Y)  where 

•  f(X,  Y)  =  min xex,yeY  d(x,  y )  is  called  the  Single-Linkage  algorithm. 

•  f(X.  Y)  =  V|'|r  Y^xeXyeY  d(xi  u)  LS  called  the  Average-Linkage  algorithm. 

•  f(X,  Y)  =  niaxrG  v  yGy  d(x,  y )  is  called  the  Complete-Linkage  algorithm. 

We  comment  that  the  classic  variants  of  these  algorithm  halt  when  there  are  k  clusters 
left.  In  this  thesis,  however,  we  make  use  of  a  slightly  less  standard  variant,  in  which  the 
algorithm  halts  whenever  a  single  cluster  is  reached.  This  in  fact  produces  a  hierarchical 
clustering  or  a  clustering  tree,  and  we  say  the  algorithm  is  laminar  with  a  given  clustering 
C  if  there  a  pruning  of  the  clustering  tree  that  produces  C. 

The  literature  regarding  agglomerative  clustering  is  rich  and  vast  and  we  do  not  survey 
it  here.  The  interested  reader  is  referred  to  [83]. 

2For  normalized  cuts,  we  think  of  the  distance  as  a  measure  of  similarity  rather  than  difference.  There  are 
numerous  way  of  coverting  distances  into  similary  measure,  the  simplest  of  which  is  to  put  an  edge  between 
any  two  x,  y  for  which  d(x,  y)  is  smaller  than  some  threshold. 
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4.2  Implicit  Stability  Assumptions  for  Clustering 


The  approximation  results  surveyed  thus  far  are  worst-case  results.  Each  of  the  works 
mentioned  poses  some  c  and  constructs  an  algorithm  which  is  guaranteed  to  produce  a  c- 
approximation  for  any  clustering  instance.  However,  while  they  guarantee  to  output  some 
C  whose  cost  is  at  most  cOPT,  they  do  not  guarantee  that  the  output  C.  satisfies  any  other 
proximity  notion  to  C*  -  they  do  not  guarantee  that  C.  and  C*  agree  on  the  label  of  most 
datapoints,  they  do  not  guarantee  that  the  centers  of  C  are  close  to  the  centers  of  C*,  and 
they  don’t  even  guarantee  that  C  and  C*  have  the  same  number  of  cluster.  In  this  section 
we  discuss  several  works  that  indeed  give  such  stronger  guarantees.  The  caveat  is  that 
the  works  we  survey  in  this  section  do  not  fall  into  the  framework  of  worst-case  analysis. 
These  works  give  stronger  guarantees  than  mere  c-approximation  of  the  cost  by  focusing 
solely  on  a  specific  subset  of  all  possible  clustering  instances  -  instances  that  adhere  to 
certain  “nice”  properties.  As  always,  the  exact  definition  of  the  properties  changes  from 
one  work  to  another. 


4.2.1  Clustering  objectives  as  a  proxy  for  datapoints  labeling 

The  work  of  Balcan,  Blum  and  Gupta  [25]  takes  a  machine-learning  viewpoint  of  clus¬ 
tering.  Balcan  et  al  consider  a  notion  of  stability  to  approximations  motivated  by  settings 
in  which  there  exists  some  (unknown)  target  clustering  Ctar9et  we  would  like  to  produce. 
Balcan  et  al  define  a  clustering  instance  to  be  (1  +  a,  5)  approximation-stable  with  respect 
to  some  objective  $  (such  as  k- median  or  A'-mcans),  if  any  A; -partition  whose  cost  under  <f> 
is  at  most  (1  +  a)OPT  agrees  with  the  target  clustering  on  all  but  at  most  Sn  data  points. 

Definition  4.4.  Given  a  clustering  instance  w.r.t  to  objective  <f>,  we  call  the  instance  (1  + 
a,  5) -approximation  stable  if  for  any  (1  +  a)  approximation  C  to  objective  (l>,  we  have 
minxes*  ]T:  | c\ar9et  -  Ca^\  <  Sn  (here,  a  is  simply  a  matching  of  the  indices  in  the 
target  clustering  to  those  in  C). 

For  instances  satisfying  the  (1  +  a,  5)  approximation-stability  assumption,  Balcan  et 
al  have  shown  the  following  results. 

Theorem  4.5  ([25]).  •  Assuming  that  the  instance  satisfies  (1  +  a,  5) -approximation 

stability  w.r.t  to  either  the  k-median  objective  or  the  k-means  objective,  then  there 
exists  a  clustering  algorithm  whose  output  disagrees  with  Ctar9et  on  no  more  than  a 
fraction  of  0(5(1  +  1/a))  of  the  datapoints. 
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•  Assuming  that  (i)  the  instance  satisfies  (1  +  a,  5) -approximation  stability  w.r.t  to  the 
k-median  objective  and  (ii)  all  clusters  have  size  at  least  n  ■  0(5(1  +  1/a)),  then 
there  exists  a  clustering  algorithm  whose  output  disagrees  with  Clar'iet  on  no  more 
than  a  fraction  of  5  of  the  datapoints. 


Assuming  that  (i)  the  instance  satisfies  (1  +  a,  5) -approximation  stability  w.r.t  to  the 
k-means  objective  and  (ii)  all  clusters  have  size  at  least  n  ■  0(5(1  +  1/a)),  then 
there  exists  an  algorithm  that  outputs  a  (k  +  l)-pcirtition  {Cj ,  Cf. . . . ,  Cj..  U},  s.t. 
yy  =  0(5(1  +  1/a))  and  mmaeSk  \ Q  \  Cf^et\  <  5n. 


•  The  problem  of  finding  a  (1  +  a) -approximation  in  general  can  be  reduce  to  the 
problem  of  finding  a  (1  +  a)-approximation  on  instances  satisfying  (1  +  a,  5)- 
approximation  stability,  for  any  5  >  1/poly  (n). 


We  do  not  give  here  full  details  as  to  the  algorithms  that  Balcan  et  al  [25]  devise. 
However,  they  are  all  extensions  of  the  following  basic  idea,  best  illustrated  using  the 
A- median  objective.  Given  the  optimal  clustering  C*,  denote  for  every  datapoint  x  its 
contribution  to  OPT  by  <p(x )  =  min id(x,ci)  and  its  distance  to  the  2nd  closest  center 
as  f>(x).  Take  the  0(5)n  points  whose  difference  ip(x)  —  f(x)  is  the  largest,  and  let  C 
be  the  clustering  in  which  of  these  points  is  assigned  to  its  2nd  closest  center.  Since  C* 
and  C  do  not  agree  on  5n  points,  then  the  cost  of  C  is  greater  than  (1  +  a)OPT.  We 
deduce  that  at  most  a  fraction  of  0(5 )  points  have  f>(x)  —  f(x)  <  f  On  the  other 
hand,  using  Markov’s  inequality  we  have  that  at  most  a  0(5 /a)  fraction  of  the  points  have 
f(x)  >  So,  for  a  fraction  of  at  least  1—0  (5(1+ 1/a))  of  the  points  neither  property 

holds.  For  any  two  points  x,  y  in  this  set,  we  have  that  if  they  belong  to  the  same  cluster 
then  d(x,  y)  <  2  a^T ,  whereas  if  they  belong  to  two  different  clusters  d(x,  y )  >  3  pyp-  • 
The  algorithms  of  Balcan  et  al  [25]  build  on  this  strict-separation  property,  taking  into 
account  the  0(5(1  +  l/a))n  outliers  and  fact  with  don’t  know  OPT  in  advance. 

Following  the  work  of  [25]  similar  notions  were  considered  in  other  works.  Balcan  et 
al  [25]  themselves  also  discuss  approximation-stability  w.r.t  to  the  min-sum  objective,  and 
their  results  were  extended  by  Balcan  and  Braverman  [28].  Balcan,  Roglin,  and  Teng  [30] 
studied  instances  that  satisfy  (1  +  a,  5) -approximation  stability  only  after  a  fraction  of  data 
is  removed.  The  work  of  Schalekamp  et  al  [130]  analyzes  the  algorithm  of  [25]  in  practice 
and  show  it  gives  a  good  approximation  to  the  k- median  objective  when  all  clusters  are 
large. 
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4.2.2  ^-clustering  whose  cost  is  much  smaller  than  the  cost  of  (k  —  1)- 
clustering 

Ostrovsky  et  al  [121]  proposed  an  interesting  condition  under  which  one  can  achieve  better 
A  - means  approximations  in  time  polynomial  in  n  and  k.  They  consider  A-mcans  instances 
where  the  optimal  /^-clustering  has  cost  noticeably  smaller  than  the  cost  of  any  (k  —  1)- 
clustering,  motivated  by  the  idea  that  “if  a  near-optimal  /^-clustering  can  be  achieved  by  a 
partition  into  fewer  than  k  clusters,  then  that  smaller  value  of  k  should  be  used  to  cluster 
the  data”  [121].  Formally, 

Definition  4.6.  Given  a  clustering  instance  w.r.t  some  objective  (l>,  let  OPT ,  denote  the 
cost  of  the  optimal  ( k-l)-clustering ,  and  OPT  denote  the  cost  of  the  optimal  k-clustering. 
We  say  the  instance  is  y-separable  if  op0?:,  <  72- 

Under  the  assumption  that  7  <  1/10,  Ostrovsky  et  al  show  that  one  can  obtain  a 
(1  +  0(72)) -approximation  for  A- means  in  time  polynomial  in  n  and  k,  by  using  Lloyd 
steps:  using  centers  {ci,  c2, . . . ,  Ck}  set  Ci  as  the  set  of  points  whose  closest  center  point 
is  Ci  and  then  set  ct  fiiC,),  and  repeat  until  convergence.  In  fact,  Ostrovsky  et  al  pay 
careful  attention  to  the  seeding  of  the  Lloyd  iterations  -  the  initial  set  of  k  centers.  More 
formally,  Ostrovsky  et  al  proved  the  following. 

Theorem  4.7  ([121]).  Given  a  k-means  instance  which  is  7- separable  for  some  7  <  1/10, 
there  exists  a  0(nkd  +  k3d)-time  algorithm  that  gives  a  (1  +  72y2) -approximation  of  the 
k-means  costw.p.  1  —  0(y/ 7).  Given  e  >  0,  there  exists  a  2 °^k^nd-time  algorithm  that 
gives  a  (1  -|-  e)  -approximation  of  the  k-means  cost  with  constant  probability. 

A  rough  outline  of  the  (1  +  0 (y2 ) )  -approx  i mation  algorithm  of  Ostrovsky  et  al  is  as  fol¬ 
lows. 

Randomly  initial  centers.  Pick  first  two  datapoints  w.p.  proportional  to  their  distance 
squared.  Pick  additional  O(k)  points  iteratively:  given  {pi, P2,  ■  ■  ■ , Pi}  choose  the 
next  point  pi+1  =  x  w.p.  min-=1  ||a;  —  p,:||2. 

Picking  A;  centers.  Given  the  Oik)  points  chosen,  find  their  Voronoi  regions  and  their 
respective  centroids.  Starting  from  this  set  of  0{k )  centroids,  apply  some  deletion 
procedure  until  k  points  are  left. 

One  “Lloyd  step”.  Given  {pu  . . .  ,pk},  let  Bi  =  {x  :  \\x  -  p{\\  <  ^  min^  \\pt  -  pj\\}. 
Return  {p(B1),p(B2)% p{Bk)}. 
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In  fact,  much  of  the  analysis  of  Ostrovsky  et  al  is  devoted  to  the  initial  center-procedure.  In 
particular,  they  show  that  by  picking  just  k  centers  using  this  sampling  technique  we  have 
at  least  a  (1  —  0{ y2/3))fc  probability  of  finding  one  point  from  each  cluster.  Alternatively, 
sampling  0(k )  points  with  this  sampling  procedure,  we  have  that  our  sample  contains  at 
least  one  point  from  each  cluster  with  constant  probability. 

Following  the  work  of  Ostrovsky  et  al,  this  sampling  procedure  was  refined  slightly 
to  what  is  now  known  as  D2  sampling.  (Where  instead  of  picking  the  first  2  points  pro¬ 
portional  to  their  distance,  we  pick  the  first  u.a.r  and  the  2nd  is  picked  the  same  way 
the  rest  are  chosen.)  Arthur  and  Vassilvitskii  [15]  showed  that  D 2  sampling  yields  a 
0 ( log ( k ) ) -appro x i m at i o n  to  the  (  -means  optimum.  Ailon  et  al  [8]  have  shown  that  pick¬ 
ing  0(klog(k))  centers  using  D2  sampling  yields  a  0(l)-approximation  to  the  A-mcans 
objective,  and  Aggarwal  et  al  [6]  showed  the  same  for  picking  0{k)  centers.  Jaiswal 
and  Garg  [91]  showed  that  using  D2  sampling  for  sampling  k  centers  gives  a  0(1)- 
approximation  of  the  A  -mcans  objective  w.p.  >  1/k  if  the  input  is  y-scparablc  for  some 
constant  7,  and  in  general  gives  a  0(l)-approximation  w.p.  Q(2~2k). 

4.2.3  Clustering  perturbation  resilient  instances 

Bilu  and  Linial  [40],  focusing  on  the  max-cut  problem  [71],  proposed  considering  in¬ 
stances  where  the  optimal  clustering  is  optimal  not  only  under  the  given  metric,  but  also 
under  any  bounded  multiplicative  perturbation  of  the  given  metric.  This  is  motivated 
by  the  fact  that  in  practice,  distances  between  data  points  are  typically  just  the  result  of 
some  heuristic  measure  (e.g.,  edit-distance  between  strings  or  Euclidean  distance  in  some 
feature  space)  rather  than  true  “semantic  distance”  between  objects.  Thus,  unless  the  op¬ 
timal  solution  on  the  given  distances  is  correct  by  pure  luck,  it  likely  is  correct  on  small 
perturbations  of  the  given  distances  as  well.  We  formally  define  metric  perturbation  and 
perturbation  resilience. 

Definition  4.8.  Given  a  metric  (S.  d),  and  a  >  1,  we  say  a  function  d'  :  S  x  S  —y  M>0  is 
an  o-pcrturbation  ofd,  if  for  any  x,  y  7  S  it  holds  that  d(x.  y )  <  d'(x,  y )  <  od(x.  y).  Note 
that  d'  may  be  any  non-negative  function  and  might  not  satisfy  the  triangle  inequality. 

Definition  4.9.  Given  a  clustering  instance  composed  of  n  points  residing  in  a  metric 
(S.  d),  we  say  the  instance  is  o-pcrturbation  resilient  for  a  clustering  objective  <f>  if  for 
any  d'  which  is  an  a-perturbation  ofd,  the  (only)  optimal  clustering  of  (S,  d')  under  $  is 
identical,  as  a  partition  of  points  into  subsets,  to  the  optimal  clustering  of  (S ,  d)  under  <f>. 

Bilu  and  Linial  [40]  analyze  max-cut  instances  satisfying  perturbation-resilience,  and 
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show  that  for  instances  that  are  stable  to  perturbations  of  multiplicative  factor  roughly 
0(n 1//2),  one  can  retrieve  the  optimal  Max-Cut  in  polynomial  time. 

Theorem  4.10  ([40]).  There  exists  a  poly-time  algorithm  that  correctly  clusters  0(y/n)- 
perturbation  resilient  max-cut  instances. 

While  the  original  version  of  the  algorithm  is  to  round  the  SDP  relaxation  of  Goemans- 
Williamson  [72]  for  the  max-cut  problem,  an  approach  that  relies  on  heavy  machinery  from 
linear  algebra,  there  is  a  simpler  combinatorial  and  deterministic  algorithm  due  to  Bilu  et 
al  [39].  This  algorithm  iteratively  identifies  two  vertices  that  belong  to  the  same  cluster 
and  contracts  them  (while  summing  the  weights  of  contracted  edges).  It  is  straight-forward 
to  see  that  the  contracted  instance  is  also  perturbation-resilience. 

Bilu  and  Linial  conjectured  that  stability  up  to  only  constant  magnitude  perturbations 
should  be  enough  to  solve  the  problem  in  polynomial  time.  A  recent  work  of  Makarychev 
et  al  [109]  improves  on  the  result  of  Bilu  and  Linial  and  improves  the  perturbation- 
resilience  factor  to  0(^/log(n)  log  log(n)).  Multiple  works  (including  the  one  in  this  the¬ 
sis)  discuss  perturbation  stability  for  other  problems,  such  as  A; -median  and  min-sum  [29, 
127],  TSP  [114],  and  finding  Nash-equilibrium  [19]. 


Additional  notions  of  stability  in  clustering.  We  conclude  by  mentioning  other  works 
that  tie  together  the  notion  of  clustering  and  stability.  Ben-David  et  al.  [36,  37]  consider 
a  notion  of  stability  of  a  clustering  algorithm,  which  is  called  stable  if  it  outputs  similar 
clusters  for  different  sets  of  n  input  points  drawn  from  the  same  distribution.  For  A  - means, 
the  work  of  Meila  [113]  discusses  the  opposite  direction  -  classifying  instances  where  an 
approximated  solution  for  A;-means  is  close  to  the  target  clustering.  And  relating  to  the 
effectiveness  of  Lloyd-type  iterations,  it  was  shown  that  this  iterative  algorithm  might  take 
exponential  time  to  converge  even  for  planar  ( d  =  2)  instances  [137],  yet  other  works 
have  shown  that  adding  random  Gaussian  noise  to  each  point  independently  causes  the 
algorithm  to  converge  in  time  0(nk )  ([17])  or  in  time  poly(n,  k)  ([16]). 

4.2.4  Our  Contribution:  Weak-Deletion  Stability  and  Perturbation- 
Resilience  for  A>Median  and  A>Means 

In  Chapter  7  we  improve  on  the  works  of  Ostrovsky  et  al  [121]  and  Balcan  et  al  [25],  and 
in  fact  unify  them.  Ostrovsky  et  al  call  a  dataset  separable  if  the  ratio  of  the  optimal  clus¬ 
tering  with  k  —  1  clusters  to  the  optimal  clustering  with  k  clusters  is  at  least  I/72  >  100 
and  give  a  (1  +  0(y2)) -approximation  for  the  A:-means  objective  for  such  an  instance. 
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We  call  a  dataset  stable  if  the  above  ratio  is  greater  than  7  for  some  constant  7  >  1,  and 
give  a  PTAS  for  such  instance.  In  other  words,  we  decouple  the  strength  of  the  assump¬ 
tion  from  the  quality  of  the  approximation  -  given  time  n0^1^  we  show  how  to  get  a 
(1  +  e) -approximation  for  either  the  k- median  of  the  A-mcans  objectives  for  any  separable 
dataset.  Balcan  et  al  give  an  algorithm  that  correctly  clusters  a  1  —  5  fraction  of  the  points 
give  Axmedian  instances  that  (i)  satisfy  the  [25]  notion  of  stability  and  (ii)  all  clusters  in  the 
optimal  clustering  are  very  large  clusters.  Using  the  same  notion  of  stability  and  defining 
cluster  as  “large”  if  their  size  is  only  Q{5n),  we  give  an  algorithm  that  approximates  the 
A  - median  or  the  Axmeans  cost  and  therefore  clusters  all  but  a  fraction  of  5  of  the  points  cor¬ 
rectly.  The  result  is  obtained  by  defining  a  new  notion  of  stability,  weak-deletion  stability, 
that  unifies  both  previously  considered  notions  of  stability. 

In  Chapter  8  we  extend  the  Bilu-Linial  [40]  notion  of  perturbation  resilience  to  any 
center-based  clustering  objective,  like  Axmedian  or  Axmeans.  We  show  that  there  exists 
an  efficient  algorithm  that  correctly  clusters  any  3-perturbation  resilient  instance.  This 
algorithm  is  no  other  than  the  Single-Linkage  algorithm  (or  rather,  a  simple  variant  on  the 
Single-Linkage  algorithm). 


4.3  Clustering  under  Explicit  Distributional  Assumptions 

All  of  the  results  mentioned  in  Section  4.2  discuss  clustering  instances  which  have  certain 
nice  properties  due  to  some  compelling  story.  In  contrast,  a  different  line  work  in  clustering 
pin-points  the  nature  of  the  clustering  instance  so  that  it  comes  from  a  specific  model. 
More  importantly,  in  this  line  of  work  the  datapoints  are  not  arbitrary  points  of  some 
instance,  but  rather  they  are  the  outcome  of  multiple  iid  drawns  from  some  distribution. 
For  these  types  of  problems,  the  algorithm  is  often  aware  of  the  nature  of  the  distribution 
(i.e.  Gaussians,  log-concave,  heavy-tail  etc.)  yet  the  specifics  of  the  distribution  (namely, 
mean  and  variance)  are  unknown. 

Specifically,  we  discuss  a  mixture  model.  In  this  type  of  instance  we  get  to  view  n 
points  in  a  d-dimensional  Euclidean  space,  and  each  point  was  drawn  iid  from  some  dis¬ 
tribution.  We  assume  that  overall  there  exist  k  different  distributions  which  leads  to  the 
existence  of  k  clusters  in  our  instance  -  where  all  points  in  cluster  i  are  drawn  from  the 
same  distribution.  In  fact,  we  can  think  of  the  points  themselves  as  points  that  were  sam¬ 
pled  independently  from  the  zth  distribution  w.p.  vj,  =  C,  \  jn.  We  will  use  the  standard 
notation  /i,  =  //(  G',)  e  W1  and  E,  e  WLdxd  to  denote  the  mean  and  variance  of  the  A t h  dis¬ 
tribution,  and  the  maximal  directional- variance  of  the  ith  distribution  (the  leading  singular 
value  of  Ej)  is  denoted  as  at.  We  also  denote  wmin  =  min,  {1/7}  and  crmax  =  max,{rrt}. 
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Observe  -  the  standard  approach  is  to  design  algorithms  that  give  good  estimation  of  the 
problem’s  parameters,  namely  {(wt,  p, ,  E i)}f=1.  In  this  thesis  however,  we  focus  on  works 
that  retrieve  the  actual  clustering  of  the  data,  after  which  it  is  easy  to  estimate  these  pa¬ 
rameters  using  the  empirical  weights,  means  and  variances. 

Of  the  immensely  rich  literature  regarding  learning  mixture  models,  we  focus  in  this 
overview  on  one  particular  venue  -  learning  under  center-separation  assumptions.  We 
discuss  works  that  cluster  instances  that  arise  from  mixture  models  under  the  additional 
assumption  of  center  separation:  That  for  every  pair  of  cluster  centers  pj  with  i  ^  j 
we  have  that  || p.t  —  pj  ||  is  lower  bounded  by  some  function  f(n,  k,  d,  Wi,  Wj,  a:] ) . 


4.3.1  Gaussian  Mixture  Model 

Probably  the  most  well-studied  case  of  this  mixture  model  is  the  Gaussian  mixture  model, 
in  which  all  k  distributions  are  multivariate  Gaussians.  Dasgupta  [56]  was  the  first  to  give 
an  algorithm  with  a  guaranteed  utility  bound  for  the  case  of  Gaussian  mixture  model  with 
center  separation  of  Vi  ^  j,  \\pi  —  pj ||  >  sfdfoi  +  <x,),  focusing  on  the  case  where 
Gaussians  are  spherical  (where  Vi,  V,  =  <7ildxd)  and  wmin  =  fl(l/k).  This  bound  was 
later  improved  to  Dasgupta  and  Schulman  [59]  for  the  case  of  spherical  Gaussian  with 
center  separation  of  Ll(d1^(ai  +  <jj)  log(n)). 

While  the  algorithms  of  Dasgupta  [56]  and  Dasgupta  and  Schulman  [59]  are  non¬ 
trivial  (the  first  involves  random  projections  and  the  latter  involves  a  variant  of  the  EM 
algorithm),  there  are  simple  observations  that  provide  intuition  as  to  why  such  clustering 
tasks  should  be  poly-time  solvable.  Focusing  on  spherical  Gaussians,  it  is  a  known  fact  that 
is  x  ~  A f(pi,  rrJdy.d)  then  E[||x  —  p\\]  =  cry/d,  and  the  variance  of  the  distance  is  roughly 
cni1/4.  Moreover,  up  to  constant  factors,  these  bounds  also  apply  for  two  independent 
samples  from  the  same  spherical  Gaussian.  In  fact,  the  following  lemma  from  [59]  gives 
tight  concentration  bounds  on  the  distances  between  any  two  datapoints. 

Lemma  4.11  ([59]).  Let  x  ~  J\f{pi,  crildxd)  and  y  ~  Af(pj,  djldxd)  for  any  i,j  (not 
necessarily  different ).  Then  for  any  a  >  0  we  have  that  w.p.  >  1  —  (fo^,h" ]  it  holds  that 

Ik  -  y\\2  e  II IM  -  Tj\\2  +  foi  +  Vj)(d  ±  d1/2+a)  ±  2|| Pi  -  Pj\\da of  +  a) 


So  roughly  speaking,  when  || pt  —  pj\\ 2  >  d^+a(af  +  aj)  we  have  that  w.h.p.  each 
point  is  closer  to  the  points  of  its  own  cluster  than  any  of  the  points  of  any  of  the  other 
clusters. 
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The  next  step  in  the  line  of  works  learning  Gaussian  mixture  models  under  center 
separation  was  the  work  of  Vempala  and  Wang  [138].  They  observed  the  for  spherical 
Gaussians,  projecting  the  data  onto  the  subspace  spanned  by  the  top  k  singular  vectors 
keeps  all  Gaussian  centers  in  place.  In  other  words,  the  subspace  spanned  by  the  top  k 
singular  vectors  is  precisely  the  subspace  spanned  by  the  k  centers  { // , ,  /r2, . . . ,  ///,}.  As 
a  result,  for  spherical  Gaussians  the  center-separation  bound  they  introduced  no  longer 
depends  on  d,  and  rather  it  is  ||/j,  —  //?j  =  (l(kl,'4(a,  +  a:/ ) ) .  Achlioptas  and  McSh- 
erry  [4]  gave  an  algorithm  for  general  Gaussians  under  the  center-separation  bound  of 
Q((Vk  +  \Jlfwi  +  \J  1  /vjj ) (a,  +  cT:i ) ) .  Achlioptas  and  McSherry’s  main  observation  is 
that  projecting  the  data  on  its  top  k  singular  values  shifts  the  cluster  centers  by  no  more 
than  eVax/yTuj,  while  leaving  the  distance  between  a  datapoint  and  its  corresponding 
cluster-center  at  \fkoi.  Therefore,  suppose  cluster  1  has  the  largest  directional  variance 
cr i  =  <7 m;lx ,  then  the  center  separation  bound  along  with  triangle  inequality  give  that  w.h.p. 
we  have  for  any  two  projected  points  x,  y  from  the  same  cluster  and  two  projected  points 
x',y'  from  cluster  Cj  we  have  ||a;  —  y\\  <  \\x'  —  y' ||,  and  for  any  two  projected  points 
x"  G  C i  and  y"  £  Cj  it  holds  that  U#'  —  y'\\  <  \\x"  —  y" ||.  As  a  result,  by  running  Single- 
Linkage  on  the  projected  instance  and  stopping  with  two  cluster,  we  have  a  partition  of  the 
dataset  that  is  laminar  with  the  original  clustering  (we  can  then  recurse  of  each  side  of  the 
partition). 


Other  Mixture  Models.  Following  Dasgupta  [56] ,  mixture  models  of  other  distributions 
were  studied  as  well.  Arora  and  Kannan  [13]  study  arbitrary  Gaussians  and  log-concave 
distributions  and  also  Achlioptas  and  McSherry  [4]  results  also  apply  to  log-concave  dis¬ 
tributions.  Kannan  et  al  [96]  and  Chaudhuri  and  Rao  [49]  apply  SVD  projections  to  other 
mixture  models.  Chaudhuri  and  Rao  [50]  also  study  mixture  model  for  product  distribu¬ 
tions,  and  Dasgupta  et  al  [54]  study  general  mixture  model  (under  center  separation  that  is 
polynomially  dependent  on  n).  Finally  Brubaker  and  Vempala  [47]  tweak  with  the  separa¬ 
tion  condition,  just  for  the  case  of  k  =  2  Gaussians,  requiring  that  \\yi  —  /i2||  is  greater  than 
the  projected  variance  of  the  Gaussian  along  the  direction  of  // 1  —  //2.  Kalai,  Moitra  and 
Valiant  [95]  give  an  algorithm  for  learning  2  Gaussians  with  no  separation  assumptions, 
and  Moitra  and  Valiant  [116]  extend  this  result  to  any  constant  k.  Belkin  and  Sinha  [35] 
give  an  algorithm  for  various  other  mixture  models  (with  no  separation  assumptions). 

4.3.2  The  Planted  Partition  Model 

A  very  different  distributional  model  was  considered  in  the  work  of  McSherry  [110].  In 
his  Planted  Partition  Model  the  input  is  a  graph  over  n  nodes  of  k  types  (corresponding 
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to  the  k  clusters).  The  edges  of  this  graphs  were  sampled  independently  from  a  Bernoulli 
random  variable,  where  for  every  the  edge  uv  is  placed  in  the  graph  w.p.  pc(u),c(v) 

where  c{u)  and  c( v)  are  the  clusters  of  u  and  v  respectively.  As  always,  our  goal  is  to 
find  the  A; -clustering  that  yielded  this  graph  (and  to  estimate  pl:j,  which  is  easy  given  the 
correct  A  -clustcring). 

It  is  clear  that  clustering  would  be  easy  if  we  had  access  to  the  matrix  P  G  Mnxn 
where  PU)V  =  pc(u),c(v)-  But  it  is  less  clear  whether  the  Planted  Partition  model  fits  into  the 
framework  of  clustering  via  center  proximity.  In  fact  -  what  are  the  cluster  means  in  this 
case?  Observe,  if  we  take  the  data  and  view  a  node  u  as  a  vector  xu  G  {0,  l}n,  indicating 
the  set  of  nodes  that  are  u’s  neighbors,  then  by  averaging  the  vectors  {xu  :  u  G  6',  }  we 
have  that  their  mean,  pt,  should  get  very  close  to  the  row  vector  corresponding  to  the  nodes 
in  C)  of  P.  Furthermore,  since  each  row  xu  is  a  random  rounding  of  //,  to  integral  values, 
then  xu  should  be  fairly  close  to  its  expectation,  and  in  fact  closer  to  p,  than  any  other  //? . 
This  gives  rise  to  the  hope  that  with  a  suitable  center  separation  bound,  we  should  be  able 
to  cluster  the  nodes  correctly.  McSherry  also  observed  that  P  is  a  rank  k  matrix,  and  so 
the  separation  bound  in  [1 10]  has  polynomial  dependence  on  k  rather  than  on  n. 

Formally,  denoting  crmax  =  rna \[Pr  McSherry  proved  that  under  center  separa¬ 
tion  of 

Ik  -  dj\\  =  ^  (<^maxVfc(^  +  log(f))) 

it  is  possible  to  correctly  cluster  all  nodes  w.p.  1  —  5. 

4.3.3  Our  Contribution:  Improving  on  Kumar-Kannan  Separability 

Aiming  to  unify  many  of  the  previous  works  regarding  mixture  models,  Kumar  and  Kan- 
nan  [104]  defined  a  deterministic  condition,  which  is  independent  of  any  specific  distribu¬ 
tion,  and  show  how  to  correctly  cluster  datasets  satisfying  this  condition.  Having  estab¬ 
lished  an  algorithm  that  correctly  clusters  datasets  satisfying  this  condition,  they  show  that 
indeed  many  of  the  previously  studied  mixture  models  do  satisfy  this  separation  condition 
(w.h.p).  However,  their  approach  has  two  drawbacks:  first,  its  guarantees  are  wasteful 
with  respect  to  k,  the  number  of  clusters,  and  don’t  match  the  best  known  bounds’  depen¬ 
dencies  on  k;  secondly,  their  condition  is  somewhat  complicated  in  comparison  to  the  idea 
of  center-separation.  In  Chapter  9  we  give  further  details  regarding  the  work  of  Kumar 
and  Kannan,  and  discuss  our  improvements.  In  particular,  we  show  how  the  basic  tools  of 
the  triangle  inequality  and  Markov  inequality  allow  us  to  give  a  simple  analysis  of  their 
algorithm  and  its  various  applications. 
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Part  II 


Differential  Privacy 


“Oh,  look  . . .  they’re  reading  '1984'  in  Ms.  Smith’s  English  class.” 
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Chapter  5 

The  Johnson-Lindenstrauss  Transform 
Itself  Preserves  Differential  Privacy 

5.1  Introduction 


The  celebrated  Johnson  Lindenstrauss  transform  [93]  is  widely  used  across  many  areas 
of  Computer  Science.  A  very  non-exhaustive  list  of  related  applications  include  met¬ 
ric  and  graph  embeddings  [45,  107],  computational  speedups  [129,  139],  machine  learn¬ 
ing  [26,  131],  information  retrieval  [123],  nearest-neighbor  search  [103,  89,  7],  and  com¬ 
pressed  sensing  [31].  Here  we  unveil  a  new  application  of  the  Johnson  Lindenstrauss 
transform  -  it  also  preserves  differential  privacy.  It  allows  us  to  release  statistics  about  our 
given  database,  while  guaranteeing  that  for  any  two  neighboring  databases  (databases  that 
differ  on  the  details  of  any  single  individual),  the  distributions  over  potential  outputs  are 
statistically  close. 

The  simplest  technique  of  preserving  differential  privacy  by  adding  small  random  noise 
(Theorem  3.3)  lies  at  the  core  of  an  overwhelming  majority  of  algorithms  that  preserve  dif¬ 
ferential  privacy.  In  fact,  many  differentially  private  algorithms  follow  a  common  outline. 
They  take  an  existing  algorithm  and  revise  it  by  adding  such  random  noise  each  time  the 
algorithm  operates  on  the  sensitive  data.  Proving  that  the  revised  algorithm  preserves  dif¬ 
ferential  privacy  is  almost  immediate,  because  differential  privacy  is  composable.  On  the 
other  hand,  providing  good  bounds  on  the  revised  algorithm’s  utility  follows  from  bound¬ 
ing  the  overall  noise  added  to  the  algorithm,  which  is  often  difficult.  Here  we  take  the  com¬ 
plementary  approach.  We  show  that  an  existing  algorithm  preserves  differential  privacy 
provided  we  slightly  alter  the  input  in  a  reversible  way.  Our  analysis  of  the  algorithm’s 
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utility  is  immediate,  whereas  privacy  guarantees  require  a  non-trivial  proof. 

We  prove  that  by  multiplying  a  given  database  with  a  vector  of  iid  normal  Gaussians, 
we  can  output  the  result  while  preserving  differential  privacy  (assuming  the  database  has 
certain  properties,  see  “our  technique”).  This  technique  is  no  other  than  the  Johnson- 
Lindenstrauss  transform,  and  it’s  guaranteed  to  preserve  w.h.p  the  L-2  norm  of  the  given 
database  up  to  a  small  multiplicative  factor.  Therefore,  whenever  answers  to  users’  queries 
can  be  formalized  as  the  length  of  the  product  between  the  given  database  and  a  query- 
vector,  utility  bounds  are  straight-forward. 

For  example,  consider  the  case  where  our  input  is  composed  of  n  points  in  Wl  given 
as  a  n  x  d  matrix.  We  define  two  matrices  as  neighbors  if  they  differ  on  a  single  row  and 
the  norm  of  the  difference  is  at  most  l.1  Under  this  notion  of  neighbors,  a  simple  privacy 
preserving  mechanism  allows  us  to  output  the  mean  of  the  rows  in  A,  but  what  about  the 
covariance  matrix  AT A?  We  prove  that  the  JL  transform  gives  a  (e,  5) -differentially  private 
algorithm  that  outputs  a  sanitized  covariance  matrix.  Furthermore,  for  directional  variance 
queries ,  where  users  give  a  unit-length  vector  x  and  wish  to  know  the  variance  of  A  along 
x  (see  definition  in  Section  5.2),  we  give  utility  bounds  that  are  independent  ofd  and  n.  In 
contrast,  all  other  differentially  private  algorithms  that  answer  directional  variance  queries 
have  utility  guarantees  that  depend  on  d  or  n.  Observe  that  our  utility  guarantees  are 
somewhat  weaker  than  usual.  Recall  that  the  JL  lemma  guarantees  that  w.h.p  lengths  are 
preserved  up  to  a  small  multiplicative  error,  so  for  each  query  our  algorithm’s  estimation 
has  w.h.p  small  multiplicative  error  and  additional  additive  error. 

A  special  case  of  directional  variance  queries  is  cut-queries  of  a  graph.  Suppose  our 
database  is  a  graph  G  and  define  two  graphs  as  neighbors  if  they  differ  on  a  single  edge. 
Indeed,  this  edge-adjacency  notion  between  graphs  is  weaker  than  the  notion  of  a  vertex- 
adjacency,  where  two  neighboring  graphs  may  differ  on  any  number  of  edges  incident 
to  same  vertex.  However,  this  notion  of  adjacency  corresponds  to  the  same  notion  of 
adjacency  between  matrices  considered  above  -  in  Chapter  2.1  we  defined  the  edge  matrix 
Eg  of  a  graph  G,  and  indeed  for  any  two  G  and  G'  that  differ  on  a  single  edge,  EG  and 
EGi  differ  on  a  single  row  (with  the  norm  of  the  difference  equals  0(1)).  For  cut-queries, 
users  pose  a  nonempty  strict  subset  of  vertices  S,  and  wish  to  know  how  many  edges  in 
G  cross  the  ( S ,  ,5')-cut.  Such  a  query  can  be  formalized  as  the  L2-norm  squared  of  the 
product  Eg1s,  where  Is  is  the  indicator  vector  of  S  (see  Chapter  2.1).  Here,  we  prove 
that  the  JL  transform  allows  us  to  publish  a  perturbed  Laplacian  of  G  while  preserving 
(e,  <5) -differential  privacy.  Comparing  our  JL-based  algorithm  to  existing  algorithms,  we 
show  that  we  add  (w.h.p)  0(151)  random  noise  to  the  true  answer  (alternatively:  w.h.p  we 

'This  notion  of  neighboring  inputs,  also  considered  in  [111,  82],  is  somewhat  different  than  the  typical 
notion  of  privacy,  allowing  any  individual  to  change  her  attributes  arbitrarily. 
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add  only  constant  noise  to  the  query  In  contrast,  all  other  algorithms  add  noise 

proportional  to  the  number  of  vertices  (or  edges)  in  the  graph. 


Our  technique.  It  is  best  to  demonstrate  our  technique  on  a  toy  example.  Assume  D 
is  a  database  represented  as  a  (0, 1} "-vector,  and  suppose  we  sample  a  vector  Y  of  n 
iid  normal  Gaussians  and  publish  X  =  YT D.  Our  output  is  therefore  distributed  like 
a  Gaussian  random  variable  of  0  mean  and  variance  a2  =  ||Z)||2.  Assume  a  single  en¬ 
try  in  D  changes  from  0  to  1  and  denote  the  new  database  as  D' .  Then  X'  =  YT D' 
is  distributed  like  a  Gaussian  of  0-mean  and  variance  A2  =  ||D||2  +  1.  Comparing 
PDF v(a^)  =  (27r<r2)-1/2 exp(— a:2/ (2<r2))  to  PDFx,(a;)  =  ^vrA2)'1/2 exp(-a;2/(2A2)) 
we  have  that  Vx,  yj  \2  /  cr2PDF  X'{x)  >  PDFY(a;)  >  exp(— ^  •  ^-)PDFx'(x).  Using 
concentration  bounds  on  Gaussians  we  deduce  that  if  A2  >  o2  =  f2(log(l/5)/e),  then  w.p 
>1  —  5  both  PDFs  are  within  multiplicative  factor  of  e±e.  We  now  repeat  this  process 
r  times  (setting  e,  5  accordingly)  s.t.  the  JL  lemma  assures  that  (after  scaling)  w.h.p  we 
output  a  vector  of  norm  (1  d=  /y)  ||  ZA  || 2  for  a  given  r/.  We  get  utility  guarantees  for  publishing 
the  number  of  ones  in  D  while  preserving  (e,  5) -differential  privacy. 

Keeping  with  our  toy  example,  one  step  remains  -  to  convert  the  above  analysis  so 
that  it  will  hold  for  any  database,  and  not  only  databases  with  w  =  log(l/5)/e  many 
ones.  One  way  is  to  append  the  data  with  w  one  entries,  but  observe:  this  ends  up  in 
outputting  A"  +  N  where  N  is  random  Gaussian  noise!  In  other  word,  appending  the  data 
with  ones  makes  the  above  technique  worse  (noisier)  than  the  classical  technique  of  adding 
random  Gaussian  noise.  Instead,  what  we  do  is  to  “translate  the  database”.  We  apply  a 
simple  deterministic  affine  transformation  s.t.  D  turns  into  a  { y/^,  l}n-vector.  Applying 
the  JL  algorithm  to  the  translated  database,  we  output  a  vector  whose  norm  squared  is 
pa  (1  ±  77) ( || ZA || 2  +  w).  Clearly,  users  can  subtract  w  from  the  result,  and  we  end  up  with 
rjw  additive  random  noise  (in  addition  to  the  multiplicative  noise).2 

It  is  tempting  to  think  the  above  analysis  suffices  to  show  that  privacy  is  also  preserved 
in  the  multidimensional  case.  Consider  the  cases  of  two  adjacent  graphs,  G  and  G'  with 
an  edge  (a,  b )  present  in  G'  and  absent  in  G.  The  corresponding  edge  matrices  EG  and 
E(ji  differ  only  on  a  single  row  (see  Chapter  2.1)  -  which  is  the  all  0  row  in  EG  and 
a  row  with  only  two  non-zero  coordinates  in  EGi.  So  when  we  compare  the  result  of 
multiplying  EJ,  with  a  vector  of  iid  normal  Gaussians  to  the  result  of  multiplying  EJG,  with 
such  vector,  only  two  coordinates  in  the  outcome  behave  differently.  Presumably,  applying 

2Observe  that  in  this  toy  example,  our  0(log(l/(5)/e)  noise  bound  is  still  worse  than  the  noise  bound  of 
0(V  l°g(l/<5)/e)  one  gets  from  adding  Gaussian  noise.  However,  in  the  applications  detailed  in  Sections  5.3 
and  5.4,  the  idea  of  changing  the  input  will  be  the  key  ingredient  in  getting  noise  bounds  that  are  independent 
of  n  and  d. 
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the  abovementioned  univariate  analysis  to  each  of  the  two  coordinates  that  change  suffices 
to  prove  we  preserve  differential  privacy.  Yet  this  intuition  is  false.  Multiplying  EG  with 
a  random  vector  does  not  result  in  n  independent  Gaussians,  but  rather  in  one  multivariate 
Gaussian.  This  is  best  illustrated  with  an  example.  Suppose  G  is  a  graph  and  S'  is  a  subset 
of  nodes  s.t.  no  edge  crosses  the  (S',  5) -cut.  Therefore  1 1 1  s- 1 1 2  =  \E(S,  5)|  =  0  so 
Eg1s  is  the  zero-vector,  and  no  matter  what  random  projection  R  we  pick,  RTEGls  =  0. 
In  contrast,  by  adding  a  single  edge  that  crosses  the  (S',  S'j-cut,  we  get  a  graph  G'  s.t. 
Pr[RJ Eg>1s  ±  0]  =  1. 


Organization.  Next  we  detail  related  work.  Section  5.2  details  important  notations  and 
formal  definitions.  In  Sections  5.3  and  5.4  we  convert  the  above  univariate  intuition  to  the 
multivariate  Gaussian  case.  Section  5.3  describes  our  results  for  graphs  and  cut-queries, 
and  in  Section  5.3.2  we  compare  our  method  to  other  algorithms.  Section  5.4  details  the 
result  for  directional  queries  (the  general  case),  then  a  comparison  with  other  algorithms. 
Even  though  there  are  clear  similarities  between  the  analyses  in  Sections  5.3  and  5.4,  we 
provide  both  because  the  graph  case  is  simpler  and  analogous  to  the  univariate  Gaussian 
case.  Suppose  G  and  G'  are  two  graphs  without  and  with  a  certain  edge  resp.,  then  G 
induces  the  multivariate  Gaussian  with  the  “smaller”  variance,  and  G'  induces  the  mul¬ 
tivariate  Gaussian  with  the  “larger”  variance.  In  contrast,  in  the  general  case  there’s  no 
notion  of  “smaller”  and  “larger”  variances.  Also,  the  noise  bound  in  the  general  case  is 
larger  than  the  one  for  the  graph  case,  and  the  theorems  our  analysis  relies  on  are  more 
esoteric.  Section  5.5  concludes  with  a  discussion  and  open  problems. 


5.1.1  Related  Work 

The  task  of  preserving  differential  privacy  when  the  given  database  is  a  graph  or  a  social 
network  was  studied  by  Hay  et  al  [84]  who  presented  a  privacy  preserving  algorithm  for 
publishing  the  degree  distribution  in  a  graph.  They  also  introduced  and  compared  between 
multiple  notions  of  neighboring  graphs,  one  of  which  is  for  the  change  of  a  single  edge. 
Nissim  et  al  [119]  (see  full  version)  studied  the  case  of  estimating  the  number  of  triangles 
in  a  graph,  and  Karwa  et  al  [100]  extended  this  result  to  other  graph  structures.  Gupta  et 
al  [77]  studied  the  case  of  answering  (S,  T)-cut  queries,  for  two  disjoint  subsets  of  nodes 
S  and  T.  All  latter  works  use  the  same  notion  of  neighboring  graphs  as  we  do.  In  differ¬ 
ential  privacy  it  is  common  to  think  of  a  database  as  a  matrix,  but  seldom  one  gives  utility 
guarantees  for  queries  regarding  global  properties  of  the  input  matrix.  Blum  et  al  [44] 
approximate  the  input  matrix  with  the  PCA  construction  by  adding  0(d 2)  noise  to  the 
input.  The  work  of  McSherry  and  Mironov  [111]  (inspired  by  the  Netflix  prize  compe- 
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tition)  defines  neighboring  databases  as  a  change  in  a  single  entry,  and  introduces  0(k'2) 
noise  while  outputting  a  rank-A;  approximation  of  the  input.  Kapralov  and  Talwar  [99]  also 
give  an  algorithm  for  releasing  the  PCA  of  a  given  matrix  while  preserving  e-differential 
privacy  (the  case  5  =  0). 

The  body  of  work  on  the  JL  transform  is  by  now  so  extensive  that  only  a  book  may 
survey  it  properly  [139].  Its  proof  have  been  revised  numerous  times  and  it  is  known 
to  work  under  various  projections  -  using  Gaussians  [58],  using  random  unit-length  vec¬ 
tors  [69],  using  random  (  —  1,  0, 1}  entries  [3],  or  sparse  matrices  [55].  Our  analysis  derives 
its  privacy  guarantees  from  applying  the  variant  of  the  JL  in  which  all  entries  are  picked 
independently  from  a  normal  Gaussian. 

In  the  context  of  differential  privacy,  the  JL  lemma  has  been  used  to  reduce  dimen¬ 
sionality  of  an  input  prior  to  adding  noise  or  other  forms  of  privacy  preservation.  Blum  et 
al  [43]  gave  an  algorithm  that  outputs  a  sanitized  dataset  for  learning  large-margin  clas¬ 
sifiers  by  appealing  to  JL  related  results  of  [26].  Hardt  and  Roth  [82]  gave  a  privacy 
preserving  version  of  an  algorithm  of  [78]  that  uses  randomize  projections  onto  the  im¬ 
age  space  of  a  given  matrix.  The  way  the  JL  lemma  was  applied  in  these  works  is  very 
different  than  the  way  we  use  it. 


5.2  Basic  Definitions,  Preliminaries  and  Notations 

Privacy  and  utility.  In  this  work,  we  deal  with  two  types  of  inputs:  [0,  l]-weighted 
graphs  over  n  nodes  and  a  x  d  real  matrices.  (We  treat  wa)b  =  0  as  no  edge  between  a  and 
b ).  Trivially  extending  the  definition  in  [119,  100],  two  weighted  n-nodes  graphs  G  and 
C  are  called  neighbors  if  they  differ  on  the  weight  of  a  single  edge  (a,  b).  Like  in  [82], 
two  n  x  (/-matrices  are  called  neighbors  if  all  the  coordinates  on  which  A  and  A'  differ  lie 
on  a  single  row  i,  s.t.  ||  A^  —  A'^  ||2  <  1,  where  A^  denotes  the  ?'-th  row  of  A. 

For  each  type  of  input  we  are  interested  in  answering  a  different  type  of  query.  For 
graphs,  we  are  interesting  in  cut-queries :  given  a  nonempty  strict  subset  S  of  the  vertices 
of  the  graph,  we  wish  to  know  what  is  the  total  weight  of  edges  crossing  the  ( S ,  .Sj-cut. 
We  denote  this  as  $G(S)  =  ^2ues,v^s  wn,v 

Definition  5.1.  We  say  an  algorithm  ALG  gives  a  (77,  r,  ^-approximation  for  cut  queries, 
if  for  every  nonempty  S  it  holds  that 

Pr  [(1  -  77  )$G(S)  -  t  <  ALG  (5)  <  (1  +  7 ])$G(S)  +  r]  >  1-u 
For  n  x  d  matrices,  we  are  interested  in  directional  variance  queries :  given  a  unit- 
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length  direction  x,  we  wish  to  know  what’s  the  variance  of  A  along  the  x  direction: 
$a(x)  =  xJ AT Ax.  (Our  algorithm  normalizes  A  s.t.  the  mean  of  its  n  rows  is  0.) 

Definition  5.2.  We  say  an  algorithm  ALG  gives  a  (rj,  r,  v) -approximation  for  directional 
variance  queries,  if  for  every  unit-length  vector  x  it  holds  that 

Pr  [(1  —  q)$A(x)  ~  t  <  ALG(x)  <  (1  +  +  r]  >  1  —  v 


Finally,  we  conclude  these  Gaussian  preliminaries  with  the  famous  Johnson-Lindenstrauss 
Lemma,  our  main  tool  in  this  paper. 


Theorem  5.3  (The  Johnson  Lindenstrauss  transform  [93]).  Fix  any  0  <  rj  <  1/2.  Let  M 
be  a  r  x  m  matrix  whose  entries  are  iid  samples  from  J\f( 0, 1).  Then  \/x  G  M”'  . 


PrM 


(1  —  77) ||a;|| 2  <  -||Ma;||2  <  (1  +  77)||x| 
r 


>1  —  2  exp(— rj2r/8) 


Additional  notations.  We  denote  by  ea  the  indicator  vector  of  a.  We  denote  by  eaj,  = 
ea  —  eft.  It  follows  that  the  n  x  n  matrix  La  h  =  ea,beTab  is  the  matrix  whose  projection 

over  coordinates  a,  b  is  T  ^  ^  ,  while  every  other  entry  is  0.  We  also  denote  Ea  h  as 


the  ©  x  n  matrix,  whose  rows  are  all  zeros  except  for  the  row  indexed  by  the  (a,  b )  pair, 
which  is  eTa  b.  Observe:  La)b  =  eafieJa  b  =  EJa  bEa,b. 
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5.3  Publishing  a  Perturbed  Laplacian 

5.3.1  The  Johnson-Lindenstrauss  Algorithm 

We  now  show  that  the  Johnson  Lindenstrauss  transform  preserves  differential  privacy.  We 
first  detail  our  algorithm,  then  analyze  it. 

Algorithm  2:  Outputting  the  Laplacian  of  a  Graph  while  Preserving  Differential 
Privacy 

Input:  A  /i-node  graph  G,  parameters:  e,  d,  r/,  v  >  0 
Output:  A  Laplacian  of  a  graph  L 

1  Set  r  =  8-^1,  and  w  =  V32r1^  ln(4 r/8) 

2  For  every  u  f  v,  set  wU:V  <(-  ^  +  (l  —  f)  wUtV. 

3  Pick  a  matrix  M  of  size  r  x  (”) ,  whose  entries  are  iid  samples  of  A/"(0, 1). 

4  return!/  =  £ EJGMJMEG 


Algorithm  3:  Approximating  *g{S) 

Input:  A  non  empty  S  C  V  ( G ),  parameters  n,  w  and  Laplacian  L  from 
Algorithm  2. 

return  R(S)  =  (lJsLls  —  w  sl'n~s'>  j 


Theorem  5.4.  Algorithm  2  presents  (e,  5)-differential privacy  w.r.t  to  edge  changes  in  G. 

Theorem  5.5.  For  every  p,  u  >  0,  given  a  nonempty  S  of  size  s  <  n,  Algorithm  3  gives  a 
(77,  r,  ^-approximation  for  the  cut  &a(S),for  r  =  0(s  ■  hl£/L>  j n(ln(l/u)/p2S)). 

Corollary  5.6.  For  every  e,  5,  >  0  and  any  predefined  set  of  k  cut-queries,  there 

exists  a  (e.  d) -differentially  private  mechanism  that  w.p.  >  1  —  v  approximates  each 
cut  query  up  to  a  multiplicative  factor  of  1  ±  //  and  an  additive  error  of  r  =  0(s  ■ 

VtailEfim  ln(ln(t/0/»2'5)). 


Clearly,  once  Algorithm  2  publishes  L,  any  user  interested  in  estimating  (I>c{S)  for 
some  nonempty  S  C  V (G)  can  run  Algorithm  3  on  her  own.  Also,  observe  that  w  is 
independent  of  n,  which  we  think  of  as  large  number,  so  we  assume  thoughout  the  proofs 
of  both  theorems  that  both  k  are  <  1/2.  Now,  the  proof  of  Theorem  5.5  is  immediate 
from  the  JL  Lemma. 
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Proof  of  Theorem  5.5.  Let  us  denote  G  as  the  input  graph  for  Algorithm  2,  and  H  as  the 
graph  resulting  from  the  changes  in  edge- weights  Algorithm  2  makes.  Therefore, 

w  /  w\ 

Lh  =  LmK  +  =  — Lk„  +1 - Lg 

n  v  " '  n  V  n ) 

Fix  S.  The  JL  Lemma  (Theorem  5.3)  assures  us  that  w.p.  >  1  —  v  we  have 
(1  —  r])lTsLHls  <  lTsLls  <  (1  +  rj)lJsLHls 
The  proof  now  follows  from  basic  arithmetic  and  the  value  of  w. 

R(S )  ST^((l  +  i?)lJiHlS-m£L^) 

n  \  / 

1  ( /,  ^w  ,  .  \/,  *  s(n  —  s)\ 

=  V1  +  ^  ns(n  _  s)  +  (1  +  ^X1  -  -)15^g15  -  W - - - J 

<  (1  +  ri)$G(S)  +  s  =  (!  +  +  T 

n 

where  r  <  Iijuj  ■  s.  The  lower  bound  is  obtained  exactly  the  same  way.  □ 

Comment.  The  guarantee  of  Theorem  5.5  is  not  to  be  mistaken  with  a  weaker  guarantee 
of  providing  a  good  approximation  to  most  cut-queries.  Theorem  5.5  guarantees  that  any 
set  of  k  predetermined  cuts  is  well- approximated  by  Algorithm  3,  assuming  Algorithm  2 
sets  v  <  1/2 k.  In  contrast,  giving  a  good  approximation  to  most  cuts  can  be  done  by  a 
very  simple  (and  privacy  preserving)  algorithm:  by  outputting  the  number  of  edges  in  the 
graph  (with  small  Laplacian  noise).  Afterall,  we  expect  a  cut  to  have  -fks{n  —  s)  edges 

(2) 

crossing  it.  In  other  words,  the  high  probability  of  success  is  over  internal  randomness  in 
the  algorithm  and  not  randomness  in  the  choice  of  cuts. 

We  turn  our  attention  to  the  proof  of  Theorem  5.4.  We  fix  any  two  graphs  G  and 
G',  which  differ  only  on  a  single  edge,  (a,  b ).  We  think  of  (a,  b )  as  an  edge  in  G'  which 
isn’t  present  in  G,  and  in  the  proof  of  Theorem  5.4,  we  identify  G  with  the  manipulation 
Algorithm  2  performs  over  G,  and  assume  that  the  edge  (a,  b)  is  present  in  both  graphs, 
only  it  has  weight  —  in  G.  and  weight  1  in  6".  Clearly,  this  analysis  carries  on  for  a  smaller 
change,  when  the  edge  (a,  b)  is  present  in  both  graphs  but  with  different  weights.  (Recall, 
we  assume  all  edge  weights  are  bounded  by  1.) 

Now,  the  proof  follows  from  assuming  that  Algorithm  2  outputs  the  matrix  O  —  M Eq, 
instead  of  L  —  \0T0.  (Clearly,  outputting  0  allows  one  to  reconstruct  L.)  Observe  that 
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O  is  composed  of  r  identically  distributed  rows:  each  row  is  created  by  sampling  a  (")- 
dimensional  vector  Y  whose  entries  ~  J\f( 0, 1),  then  outputting  YtEg .  Therefore,  we 
prove  Theorem  5.4  by  showing  that  each  row  maintain  (eo,  <5o) -differential  privacy,  for  the 
right  parameters  eo,  <5o-  To  match  standard  notion,  we  transpose  row  vectors  to  column 
vectors,  and  compare  the  distributions  EGY  and  EJG,Y . 

Claim  5.7.  Set  e0  =  ,  c  <5o  =  4~.  Then, 

v/4r  ln(2/<5)  zr 

Vx,  PDFety(z)  <  eeo PDF et/Y(x)  (5.1) 

Denote  S  =  {x  :  PDFety(x)  >  e_e°PDFET /Y(a;)}.  Then 

Pr[5]  >1-S0  (5.2) 

Proof  of  Theorem  5.4  based  on  Claim  5.7.  Apply  the  composition  theorem  of  [65]  for  r 
iid  samples  each  preserving  (e0,  50) -differential  privacy.  □ 

To  prove  Claim  5.7,  we  denote  X  =  EJ,Y  and  X'  =  EG,Y .  From  the  preliminaries  it 
follows  that  X  is  a  multivariate  Gaussian  distributed  according  to  J\f( 0,  EGLn\xtn\EG)  = 

J\f(0,LG),  and  similarly,  X'  ~  A/"(0,  Lq1)-  In  order  to  analyze  the  two  distributions, 
A/”(0,  Lg )  and  A/”(0,  LGf,  we  now  discuss  several  of  the  properties  of  LG  and  LG>,  then 
turn  to  the  proof  of  Claim  5.7. 

First,  it  is  clear  from  definition  that  the  all  ones  vector,  1,  belongs  to  the  kernel  space 
of  Eg  and  EGG  and  therefore  to  the  kernel  space  of  LG  and  LG/.  Next,  we  establish  a 
simple  fact. 

Fact  5.8.  If  G  is  a  graph  s.t.  for  every  a  f  v  we  have  that  vju>v  >  0,  then  1  is  the  only 
vector  in  the  kernel  space  of  EG  and  LG. 

Proof.  Any  non-zero  ill  has  at  least  one  positive  coordinate  and  one  negative  coor¬ 
dinate,  thus  the  non-negative  sum  ||f?Ga;||2  =  xTLGx  =  Yhu^vwn,v{xu  —  xv)2  is  strictly 
positive.  □ 

Therefore,  the  kernel  space  of  both  LG  and  of  LG>  is  exactly  the  1-dimensional  span 
of  the  1  vector  (for  every  possible  outcome  y  of  Y  we  have  that  Efy  ■  1  =  Ef,y  -1  =  0). 
Alternatively,  both  A"  and  X'  have  support  which  is  exactly  V  =  l1.  Hence,  we  only 
need  to  prove  the  inequalities  of  Claim  5.7  for  x  G  V.  Secondly,  observe  that  LGt  = 
Lg  +  (1  —  f  )La  h.  Therefore,  it  holds  that  for  every  x  G  M"  we  have  :Y  Lc>x  =  xJ  LGx  + 
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(1  —  (xa  —  xb )2  >  xtLgx.  In  other  words,  LG  A  LGG  a  fact  that  yields  several  important 

corollaries. 

We  now  introduce  notation  for  the  Singular  Value  Decomposition  of  both  LG  and 
Lg,.  We  denote  Ef  =  UEVT  and  EG,J  =  U'AV'f  resulting  in  LG  =  UYi2U\  LG>  = 
U'A2Un,  Lg  =  UYj~2Ut  and  LG,  =  U'A~2Un.  We  denote  the  singular  values  of  LG  as 
erf  >  ...  >  a2_ ,  >  a2  =  0,  and  the  singular  values  of  LG>  as  Af  >  . . .  >  \2n_i  >  =  0. 

Weyl’s  inequality  2.2  allows  us  to  deduce  the  following  fact. 

Fact  5.9.  Since  LG  V  LG>  then  for  every  i  we  have  that  \2  >  of. 

In  addition,  since  Algorithm  2  alters  the  input  graphs  s.t.  the  complete  graph  —LKn  is 
contained  in  G,  then  it  also  holds  that  —Lk„  A  La,  and  so  Fact  5.9  gives  that  for  every 
1  <  i  <  n  —  lwe  have  that  a2  >  w  =  —  •  n.  (It  is  simple  to  see  that  the  eigenvalues  of  Kn 
are  (n,  n, . . . ,  n,  0}.)  Furthermore,  as  LG>  =  LG  +  (1  —  ^ )Layb  and  the  singular  values  of 
La  b  are  (2,  0,0,...,  0},  then  we  have  that 

Y  =  tr(LG')  <  tr(LG )  +  tr  ((1  -  ^)La,bj  <  Y  +  2 

i  i 

Another  fact  we  can  deduce  from  LG  A  Lc,,  is  the  following.  It  is  immediate  applica¬ 
tion  of  Fact  2.1. 

Fact  5.10.  Since  the  kernels  of  LG  and  of  LGf  are  identical,  then  for  every  x  it  holds  that 

ry  T  7"  t  ry.  ^  7  t  ry. 

Jb  J-Jq/iL  -L  J-jQjU. 

Having  established  the  above  facts,  we  can  turn  to  the  proof  of  privacy. 

Proof  of  Claim  5.7.  We  first  prove  the  upper  bound  in  (5.1).  As  mentioned,  we  focus  only 
on  x  e  V  =  l-1,  where 


PDFety{x)  =  ((2vr)n  Met(LG))  '  exp(-^xr  L^x) 
PDF Efv(x)  =  ((2vr)n_1det(LG/)j  7  exp{-)^xJ  L]G,x) 


As  noted  above,  we  have  that  for  every  x  it  holds  that  xTL  q/  cc  Yi  LGx,  so  exp(— |a:T  lgx )  < 
exp(— ^xTLG,x).  It  follows  that  for  every  x  we  have  that  pDF£rP  -  -  <  f  \  — 


El’Y 


(x)  -  \det(LG)J 
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/  ,2  \  l/2 

(rilLi  ~^)  •  Denoting  A =  X'f  -  af  >  0,  and  recalling  that  ■  A*  —  2  and  that 

Vi,  of  >  wit  holds  that 


PDFETy(a;)  < 
PDFET,y(a;)  ^ 


<  exp 


gw  <  £  V 4r  ln(2/ <5)  =  g€ 


We  now  turn  to  the  lower  bound  of  (5.2).  We  start  with  analyzing  the  term  xT LGx  that 
appears  in  PDF^Tifa;).  Again,  we  emphasize  that  igV,  justifying  the  very  first  equality 
below. 

xJ  L^Gx  =  xT  LgLg'  LG,x  =  x1  L}g  +  (1  —  —)Labj  LG,x 
=  x1 Lg,x  +  (1  —  —  )xT  LGLabLG,x 
=  xjL^g,x  +  (1  -  )xTLGea,b  •  eTa,bLG,x 


Therefore,  if  we  show  that 


xJL]Gea)b  ■  eTabL]G,x  > 


reo 


< 


(5.3) 


then  it  holds  that  w.p.  >  1  —  <50  we  have 


PDFETy(x)  /  1  T,  t  t  ,  . 

PDF£t  y(x)  £  ^  ( “2*  (Lo  ~  £c)x  I  -  exp 


J  G' 


|  _  w 

— -*-xTL}seaib 


alj-Aj'X  )  >  e  eo 


which  proves  the  lower  bound  of  (5.2).  We  turn  to  proving  (5.3). 

Denote  terrni  =  eTa  bLGx  and  term2  =  eTa  bLG,x.  Since  x  =  EGy  where  y  ~  Y  then 
terrrii  is  distributed  like  vecJY  where  vec\  =  EGL  'Gea  ll  and  vec2  =  EGL  'Gleajr  The  naive 
bound,  ||ueci||  <  \\EG\\  \\LG\\  || ea,6 1|  gives  a  bound  on  the  size  of  vec\  which  is  dependent 
on  the  ratio  -f3—.  We  can  improve  the  bound,  on  both  ||neci||  and  ||nec2||,  using  the  SVD 

of  Eg  and  E(y. 


||«ec,||  =  ||SGiJ;ea,t||  =  \\V£,lTU’£r2UTe<Li\\  =  ||V2E-1C/Te„,1|| 

<\\V\\  l|S-‘ll  l|f/||  llea.t.11  =  1  ■  <T~1,  ■  1  •  n/2  =  A 

Yw 

1 1  zaec2 1 1  =  ||^'G-^G,®a!^ll  =  II  (-^G'  (1  —)  Ea)b)  L^G,eab\\  <  1 1  -E'c?'  1 1  3~  II  EabL^G,eab 
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<  Ani_i  •  v/2  +  \\EatbLG,eafi\\  —  —=  +  eTabL^G,eatb 


< 


V2 


V2  fl  +  V2 


w  w 


w 


w 


where  the  bound  in  (*)  is  derived  just  like  in  vec\  (using  EG'L'G,ea,b  =  V'AUnU'A  2Unea,b ) 

,  and  the  equality  in  (**)  follows  from  the  fact  that  all  coordinates  in  the  vector  Ea^LG,ea^ 
are  zero,  except  for  the  coordinate  indexed  by  the  (a,  b )  pair. 

We  now  use  the  fact  that  term i  and  term2  are  both  linear  combinations  of  i.i.d  A/”(0, 1) 
random  variables.  Therefore  for  i  =  1,  2  we  have  that  term,  ~  A/”(0,  ||vec;||2)  soPr[|fermj|  > 

_  I \vecjW2  log(2/.5n) 

y/log(2/(5o)||necj||]  <  e  A<=h\i'2  <  It  follows  that  w.p  >  1  —  <50  both  term 1 1  < 

y/log (2Ao)y^  and  \term2\  <  ^ log(2/<J0  so  termi  ■  term2  <  v/81og(2 /S0)/w. 

Plugging  in  the  value  of  w,  we  have  that  Pr  [termi  •  term2  <  2e0]  >  1  —  «50  which 
concludes  the  proof  of  (5.3)  and  of  Claim  5.7.  □ 


5.3.2  Discussion  and  Comparison  with  Other  Algorithms 

Recently,  Gupta  et  al  [77]  have  also  considered  the  problem  of  answering  cut-queries 
while  preserving  differential  privacy,  examining  both  an  iterative  database  construction 
approach  (e.g.,  based  on  the  multiplicative-weights  method)  and  a  randomized-response 
approach.  Here,  we  compare  this  and  other  methods  to  our  algorithm.  We  compare  them 
along  several  axes:  the  dependence  on  n  and  s  (number  of  vertices  in  G  and  in  S  resp.), 
the  dependence  on  e,  and  the  dependence  on  k  -  the  number  of  queries  answered  by 
the  mechanism.  Other  parameters  are  omitted.  The  bottom  line  is  that  for  a  long  non- 
adaptive  query  sequence,  our  approach  dominates  in  the  case  that  s  =  o(n).  The  results 
are  summarized  in  Table  5.1. 

Note,  comparing  the  dependence  on  k  for  interactive  and  non-interactive  mechanisms 
is  not  straight-forward.  In  general,  non-interactive  mechanisms  are  more  desirable  than 
interactive  mechanisms,  because  interactive  mechanisms  require  a  central  authority  that 
serves  as  the  only  way  users  can  interact  with  the  database.  However,  interactive  mecha¬ 
nisms  can  answer  k  adaptively  chosen  queries.  In  order  for  non-interactive  mechanisms 
to  do  so,  they  have  to  answer  correctly  on  min{exp(0(£;)),  2”}  queries.  This  is  why  out- 
putting  a  sanitized  database  is  often  considered  a  harder  task  than  interactively  answering 
user  queries.  We  therefore  compare  answering  k  adaptively  chosen  queries  for  interactive 
mechanisms,  and  k  predetermined  queries  for  non-interactive  mechanism. 
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5.3.2.1  Naively  Adding  Laplace  Noise 

The  most  basic  of  all  differentially  private  mechanisms  is  the  classical  Laplace  mechanism 
which  is  interactive.  For  a  given  e,  A  user  poses  a  cut-query  S  and  the  mechanism  replies 
with  <Fg(5')  +  Lap( 0,  e-1)  (since  the  global  sensitivity  of  cut-queries  is  1).  The  composi¬ 
tion  theorem  of  [65]  assures  us  that  for  k  queries  we  preserve  (0(y/ke),  c))- privacy.  As  a 
result,  if  we  wish  to  preserve  (e,  5) -differential  privacy  for  a  given  set  of  k  cut  queries,  we 
must  set  e  >  e/y/k  and  answer  each  query  with  additive  error  of  roughly  y/k/e.  So  the 
mechanism  completely  obfuscates  the  true  answer  if  k  >  n4. 

5.3.2.2  The  Randomized  Response  Mechanism 

The  “Randomized  Response”  algorithm  perturbs  the  edges  of  a  graph  in  a  way  that  allows 
us  to  publish  the  result  and  still  preserve  privacy.  Given  G,  the  Randomized  Response 
algorithm  constructs  a  weighted  graph  H  where  for  every  u,  v  e  V ( G ),  the  weight  of  the 
edge  (u,  v )  in  H,  denoted  w'uv,  is  chosen  independently  to  be  either  1  or  —1.  Each  edge 
picks  its  weight  independently,  s.t.  Pr[ w'uv  =  1]  =  l+c^u-v  and  Pri/./,,,  =  —1]  =  1 
Clearly,  this  algorithm  maintains  e-differential  edge  privacy:  two  neighboring  graphs  differ 
on  a  single  edge,  (a,  b ),  and  obviously 

Pr  [w'ab  =  1  I  wa,b  =  1]  <  (1  +  e)Pr  [w'ab  =  1  |  wa,b  =  0] 

In  addition,  it  is  also  evident  that  for  every  nonempty  S  C  V(G),  we  have  that  E  Y)ug5.  veS  w'uv\  = 
e  Yhu&Svis  Wv-,v  —  £®g(S),  yet  the  variance  of  this  r.v.  is  Q(s(n  —  s)).  Therefore,  a  clas¬ 
sical  Hoeffding-type  bound  gives  that  for  any  nonempty  S  C  V ( G )  we  have  that  for  every 

0  <  v  <  1/2, 

Pr  J  E 

u£S,v£S 

Observe  that  while  \Js(n  —  s)  is  a  comparable  with  s  when  s  =  O(n),  there  are  cuts 

(namely,  cuts  with  s  =  0(1))  where  =  O ( \/n) .  More  generally,  the  additive 

noise  of  Randomized  Response  is  a  factor  yjn/s  worse  than  our  algorithm.  We  comment 
that  the  Randomized  Response  algorithm  can  also  be  performed  in  a  distributed  fashion, 
and  in  contrast  to  our  algorithm,  it  has  no  multiplicative  error.  In  addition,  the  above 
analysis  holds  for  any  linear  combination  of  edge,  not  just  the  s(n  —  s )  potential  edges 
that  cross  the  (S,  S)  cut.  So  given  E'  C  E(G)  it  is  possible  to  approximate  we  up 
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to  ±  ^  w.p.  >1  —  2 v.  In  particular,  for  queries  regarding  an  (S,  T)- cut  (where 

S,  T  are  two  disjoint  subsets  of  vertices)  we  can  estimate  the  error  up  to  ±  v/lgHrlk’s(-1/id _ 
We  also  comment  that  the  version  of  Randomized  Response  presented  here  differs  slightly 
from  the  version  of  [77].  In  particular,  it  is  possible  to  address  their  concern  regarding 
outputting  a  sanitized  graph  with  non-negative  weights  by  an  affine  transformation  taking 
{-1,1} ->{0,1}. 


5.3.2.3  Exponential  Mechanism  /  BLR 

The  exponential  mechanism  [112,  43]  is  a  non- interactive  privacy  preserving  mechanism, 
which  is  typically  intractable.  To  implement  it  for  cut-queries  one  needs  to  (a)  specify  a 
range  of  potential  outputs  and  (b)  give  a  scoring  function  over  potential  outputs  s.t.  a  good 
output’s  score  is  much  higher  than  all  bad  outputs’  scores. 

One  such  set  of  potential  outputs  is  derived  from  edge-sparsifiers.  Given  a  graph  G 
we  say  that  H  is  an  edge-sparsifier  for  G  if  for  any  nonempty  S  C  V (G)  it  holds  that 
&h(S)  e  (1  ±  r])&G(S).  There’s  a  rich  literature  on  sparsifiers  (see  [38,  133,  132]),  and 
the  current  best  known  construction  [33]  gives  a  (weighted)  sparsifier  with  0(n/r) 2)  edges 
with  all  edge- weights  <  poly(n).  By  describing  every  edge’s  two  endpoints  and  weight, 
we  have  that  such  edge-sprasifiers  can  be  described  using  0(nlog(n))  bits  (omitting  de¬ 
pendence  on  rf).  Thus,  the  set  of  all  sparsifiers  is  bounded  above  by  exp(0(nlog(n))). 
Given  an  input  graph  G  and  a  weighted  graph  H,  we  can  score  H  using  q(G,  H)  = 
maxs  {mina:  \a-i\<v  |<3 ?H(S)/a  —  <3>g(S')|  }.  Observe  that  if  we  change  G  to  a  neighbor¬ 
ing  graph  G',  then  the  score  changes  by  at  most  1. 

Putting  it  all  together,  we  have  that  given  input  G  the  exponential  mechanism  gives 
a  score  of  e~eq<yG,H^2  to  each  possible  output.  The  edge-sparsifier  of  G  gets  score  of  1, 
whereas  every  graph  with  q(G,  H)  >  r  gets  a  score  of  e-eT//2.  So  if  we  wish  to  claim  we 
output  a  graph  whose  error  is  >  r  w.p.  at  most  v,  then  we  need  to  set  exp(n  log(rt)  — 
er/2)  <  v.  It  follows  that  r  is  proportional  to  nlog(n)/e.  Note  however  that  the  additive 
error  of  this  mechanism  is  independent  of  the  number  of  queries  it  answers  correctly. 

We  comment  that  even  though  we  managed  to  find  a  range  of  size  2°(nlog(n)\  it  is 
possible  to  show  that  the  range  of  the  mechanism  has  to  be  2n<™\  (Fix  a  <  1/2  and 
think  of  a  set  of  inputs  Q  where  each  G  G  Q  has  n / 2  vertices  with  degree  na  and  n / 2 
vertices  with  degree  n2a.  Preserving  all  cuts  of  size  1  up  to  (1  ±  rf)  requires  our  output 
to  have  vertices  of  degree  >  (1  —  rf)n2a  and  vertices  of  degree  <  (1  +  rj)na.  Therefore, 
by  representing  vertices  of  high-  and  low-degree  using  a  binary  vector,  there  exists  an 
injective  mapping  of  balanced  (0,  l}n-vectors  onto  the  set  of  potential  outputs.)  Thus, 
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unless  one  can  devise  a  scoring  function  of  lower  sensitivity,  the  exponential  mechanism 
must  have  additive  error  proportional  to  n/e. 


5.3.2.4  The  Multiplicative  Weights  Mechanism 

The  very  elegant  Multiplicative  Weights  mechanism  of  Hardt  and  Rothblum  [80]  can  be 
adapted  as  well  for  answering  cut  queries.  In  the  Multiplicative  Weights  mechanism,  a 
database  is  represented  by  a  histogram  over  all  N  “types”  of  individuals  that  exist  in  a 
certain  universe.  In  our  case,  each  pair  of  vertices  is  a  type,  and  each  entry  in  the  database 
is  an  edge  detailing  its  weight.  Thus,  N  =  (")  and  the  database  length  =  \E\ ,3  and  each 
query  S  corresponds  to  taking  a  dot-product  between  this  histogram  the  (")  -length  binary 
vector  indicating  the  edges  that  cross  the  cut.  Plugging  these  parameters  into  the  main 
theorem  of  [80],  we  get  an  adaptive  mechanism  that  answers  k  queries  with  additive  noise 
ofd(y/\E\log(k)/e). 

We  should  mention  that  the  Multiplicative  Weights  mechanism,  in  contrast  to  ours,  al¬ 
ways  answers  correctly  with  no  multiplicative  error  and  can  deal  with  k  adaptively  chosen 
queries.  Furthermore,  it  allows  one  to  answer  any  linear  query  on  the  edges,  not  just  cut- 
queries  and  in  particular  answer  (S,  T)-cut  queries.  However,  its  additive  error  is  bigger 
than  ours,  and  should  we  choose  to  set  k  —  2n  (meaning,  answering  all  cut-queries)  then 
its  additive  error  becomes  0(ny/\E\/e)  (in  contrast  to  our  0(sy/n/e)). 

Gupta  et  al  [77]  have  improved  on  the  bounds  on  the  Multiplicative  Weights  mecha¬ 
nism  by  generalizing  it  as  a  “Iterative  Database  Construction”  mechanism,  and  providing 
a  tighter  analysis  of  it.  In  particular,  they  have  reduced  the  dependency  on  e  to  1/y/e. 
Overall,  their  additive  error  is  0{\J\E\  log {k)/y/e),  which  for  the  case  of  all  cut-queries 
is  0{yJn\E\/e). 


5.3.2.5  Our  Algorithm 

Clearly,  our  algorithm  is  non-interactive.  As  such,  if  we  wish  to  answer  correctly  w.h.p. 
a  set  of  k  predetermined  queries,  we  set  v'  =  v/k,  and  deduce  that  the  amount  of  noise 
added  to  each  query  is  0(s\J\og(k)/e).  So,  if  we  wish  to  answer  all  2n  cut  queries  cor¬ 
rectly,  our  noise  is  set  to  0(sy/n/e).  An  interesting  observation  is  that  in  such  a  case 
we  aim  to  answer  all  2n  queries,  we  generate  a  iid  normal  matrix  of  size  r  x  n  where 

3Observe  that  it  is  not  possible  to  assume  \E\  =  0{n)  using  sparsifiers,  because  sparsifiers  output  a 
weighted  graph  with  edge-weights  0(n).  Since  the  Multiplicative  Weights  mechanism  views  the  database 
as  a  histogram  the  overall  resolution  of  the  problem  remains  roughly  n2  in  the  worst  case. 
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Method 

Additive  Error 
for  any  k 

Additive  Error 
for  all  Cuts 

Multi¬ 

plicative 

Error? 

Inter¬ 

active? 

Tract¬ 

able? 

Comments 

Laplace  Noise 

0{Vk/e) 

0(2n/2/e) 

X 

/ 

/ 

Randomized 

Response 

0(y/ sn  log(fc)/e) 

0(n^s/e) 

X 

X 

/ 

Can  be  dis¬ 
tributed;  answers 
(S,  T)-cut 
queries 

Exponential 

Mechanism 

0(rclog(n)/e) 

0(n  log  (n)/e) 

/ 

X 

X 

Error  ind.  of  k 

MW 

IDC 

0{\/\E\  log(fc)/e) 

0(ny/\E\/e) 

X 

/ 

/ 

Answers  ( S,T )- 
cut  queries 

0{y/\E\\og(k)/e) 

0{yfn\E\fe) 

JL 

0{sy/\og{k)  /  e) 

d(sy/n)/e) 

/ 

X 

/ 

Can  be  dis¬ 
tributed 

Table  5.1:  Comparison  between  mechanisms  for  answering  cut-queries,  e  -  privacy  pa¬ 
rameter;  n  and  \E\  -  number  of  vertices  and  edges  resp.;  s  -  number  of  vertices  in  a  query; 
k  -  number  of  queries. 


r  >  n.  Therefore,  we  now  apply  the  JL  transform  to  increase  the  dimensionality  of  the 
problem  rather  than  decreasing  it.  This  clearly  sets  privacy  preserving  apart  from  all  other 
applications  of  the  JL  transform. 

In  addition,  we  comment  that  our  algorithm  can  be  implemented  in  a  distributed  fash¬ 
ion,  where  node  i  repeats  the  following  procedure  r  times  (where  r  is  the  number  of  rows 
in  the  matrix  picked  by  Algorithm  2):  First,  i  picks  n  —  i  —  1  iid  samples  from  A/”(0, 1) 
and  sends  the  j- th  sample,  Xj,  to  node  i  +  j.  Once  node  i  receives  i  —  1  values  from  nodes 
1,2,...,  i- 1,  it  outputs  the  weighted  sum  ^.^(-1)^*^  (y/f  +  witj(  1  -  y/f ))  (where 
(_!){,?<*}  denotes  —1  if  y  <  i,  or  1  otherwise). 


5.4  Publishing  a  Covariance  Matrix 

5.4.1  The  Algorithm 

In  this  section,  we  are  concerned  with  the  question  of  allowing  users  to  estimate  the  co- 
variance  of  a  given  sample  data  along  an  arbitrary  direction  x.  We  think  of  our  input  as  a 
n  x  d  matrix  A,  and  we  maintain  privacy  w.r.t  to  changing  the  coordinates  of  a  single  row 
s.t.  a  vector  v  of  size  1  is  added  to  We  now  detail  our  algorithm  for  publishing  the 
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covariance  matrix  of  A.  Observe  that  in  addition  to  the  variance,  we  can  output  p  =  - AT 1 , 
the  mean  of  all  samples  in  A,  in  a  differentially  private  manner  by  adding  random  Gaus¬ 
sian  noise.  (We  merely  output  p  =  p  +  J\f(0,  Al°^26)  Rxd)-)  We  denote  by  Inxd  the  n  x  d 
matrix  whose  main  diagonal  has  1  in  each  coordinate  and  all  other  coordinates  are  0. 

Algorithm  4:  Outputting  a  Covariance  Matrix  while  Preserving  Differential  Privacy 
Input:  A  n  x  d  matrix  A.  Parameters  e,  <5,  q,  u  >  0. 

1  Set  r  =  81ll^2^  and  w  =  lhA/'  in(ig r/8). 

2  Subtract  the  mean  from  A  by  computing  A  <—  A  —  -11 T A. 

3  Compute  the  SVD  of  A  =  LTEV7. 

4  Set  A  ^  U(\jYA  +  w2Inxd)VJ . 

5  Pick  a  matrix  M  of  size  r  x  n  whose  entries  are  iid  samples  of  A/”(0, 1). 

6  return  C  =  -ATMrMA. 

r 


Algorithm  5:  Approximating  fI>  i  (x) 

Input:  A  unit-length  vector  x,  parameter  w  and  a  Covariance  matrix  C  from 
Algorithm  4. 

return  R(x )  =  xTCx  —  w2. 

Theorem  5.11.  Algorithm  4  preserves  (e,  6) -differential  privacy. 

Theorem  5.12.  Algorithm  5  is  a  (rj,  r,  ^-approximation  for  directional  variance  queries, 
where  t  =  0  ln2 

Proof  of  Theorem  5.12.  Again,  the  proof  is  immediate  from  the  JL  Lemma,  and  straight¬ 
forward  arithmetics  give  that  for  every  x  w.p.  >  1  —  v  we  have  that 

(1  —  q)$A(x)  —  qw2  <  R{x )  <  (1  +  q)$A(x)  +  qw2 

so  r  =  qw2.  □ 

Comment.  We  wish  to  clarify  that  Theorem  5.12  does  not  mean  that  we  publish  a  matrix 
C  which  is  a  low-rank  approximation  to  ATA.  It  is  also  not  a  matrix  on  which  one  can 
compute  an  approximated  PCA  of  A,  even  if  we  set  v  =  l/poly(d).  The  matrix  C  should 
be  thought  of  as  a  “test-matrix”  -  if  you  believe  A  has  high  directional  variance  along  some 
direction  x  then  you  can  test  your  hypothesis  on  C  and  (w.h.p)  get  the  good  approximated 
answer.  However,  we  do  not  guarantee  that  the  singular  values  of  A1  A  and  of  C  are  close 
or  that  the  eigenvectors  of  ATA  and  C  are  comparable.  (See  discussion  in  Section  5.5.) 
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Proof  of  Theorem  5.11.  Fix  two  neighboring  A  and  A' .  We  often  refer  to  the  gap  matrix 
A'— A  as  E.  Observe,  E  is  a  rank-1  matrix,  which  we  denote  as  the  outer-product  E  =  e;nT 
(e*  is  the  indicator  vector  of  row  i  and  v  is  a  vector  of  norm  1).  As  such,  the  singular  values 
of  E  are  exactly  {1,  0, ,  0}.4 

The  proof  of  the  theorem  is  composed  of  two  stages.  The  first  stage  is  the  simpler  one. 
We  ignore  step  4  of  Algorithm  4  (shifting  the  singular  values),  and  work  under  the  premise 
that  both  A  and  A 1  have  singular  values  no  less  than  w.  In  the  second  stage  we  denote  B 
and  Bl  as  the  results  of  applying  step  4  to  A  and  A'  resp.,  and  show  what  adaptations  are 
needed  to  make  the  proof  follow  through. 


Stage  1.  We  assume  step  4  was  not  applied,  and  all  singular  values  of  A  and  A'  are  at 
least  w. 

As  in  the  proof  of  Theorem  5.4,  the  proof  follows  from  the  assumption  that  Algorithm  4 
outputs  0T  =  ATM  (which  clearly  allows  us  to  reconstruct  C  =  ^ 0T0 ).  Again  0T  is 
composed  of  r  columns  each  is  an  iid  sample  from  ATY  where  Y  rs_/  A/”(0,  Inxn)-  We  now 
give  the  analogous  claim  to  Claim  5.7. 

Claim  5.13.  Fix  eo  =  ,  c  and  So  =  Denote  S  =  {x  :  e_e°PDF4,TV(a;)  < 

y/ 4r  ln(2/5)  u  2r  l  A  Y\  )  — 

PDFaty(x )  <  ee°PDFA,Ty(a;)}.  Then  Pr [/S']  >  1  —  40. 

Again,  the  composition  theorem  of  [65]  along  with  the  choice  of  r  gives  that  overall 
we  preserve  (e,  5) -differential  privacy.  □ 


Proof  of  Claim  5.13.  The  proof  mimics  the  proof  of  Claim  5.7,  but  there  are  two  subtle 
differences.  First,  the  problem  is  simpler  notation-wise,  because  A  and  A'  both  have  full 
rank  due  to  Algorithm  4.  Secondly,  the  problem  becomes  more  complicated  and  requires 
we  use  some  heavier  machinery,  because  the  singular  values  of  A 1  aren’t  necessarily  bigger 
than  the  singular  values  of  A.  Details  follow. 

First,  let  us  formally  define  the  PDF  of  the  two  distributions.  Again,  we  apply  the  fact 
that  AYY  and  A'JY  are  linear  transformations  of  J\f( 0,  Inxn ). 


PDF  aty{x) 


1 

\J ( 2n)d  det(ATA) 


exp(— 


-/(XA) 


4For  convenience,  we  ignore  the  part  of  the  algorithm  that  subtracts  the  mean  of  the  rows  of  A.  Observe 
that  if  E  =  A  —  A!  then  after  subtracting  the  mean  from  each  row,  the  difference  between  the  two  matrices 
is  efv  where  e)  is  simply  subtracting  1/n  from  each  coordinate  of  e,;.  Since  II 011  <  ||  e.j || ,  this  has  no  effect 
on  the  analysis. 
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1 


pdfj;T 


A''Y 


X)  = 


exp(— ^-xJ(AlTAl)  1x) 


yj  (27r)ddet  (A'^  A’) 


Our  proof  proceeds  as  follows.  First,  we  show 


,-eo/2 


< 


'  det(A,TA') 
det(ATA) 


<  e 


eo/2 


(5.4) 


Then  we  show  that  no  matter  whether  we  sample  x  from  AJY  or  from  A'JY .  we  have  that 


PrT 


-  \xT  ((AU4)-1  -  (A^A'y1)  x\  >  e0/2 


<  So 


(5.5) 


Clearly,  combining  both  (5.4)  and  (5.5)  proves  the  claim. 

Let  us  prove  (5.4).  Denote  the  SVD  of  A  =  UYVT  and  A '  =  U'AV'T,  where  the 
singular  values  of  4  are  o'!  >  cr2  >  ...  >  ad  >  0  and  the  singular  values  of  A'  are 
Ai  >  A2  >  . . .  >  \d  >  0.  Therefore  we  have  ATA  =  VY?VT ,  A,TA'  =  V' K2V'J 
and  also  (AU4)-1  =  VY~2V\  {A1  A')-1  =  V' A~2V'\  Thus  det (ATA)  =  Uti°i  and 

dfit(^Mo  =  ntiA?. 

This  time,  in  order  to  bound  the  gap  —  a2) /a2  it  isn’t  sufficient  to  use  the  trace 

of  the  matrices.  Instead,  we  invoke  an  application  of  Lindskii’s  theorem  2.4. 

Fact  5.14  (Lindskii).  For  every  k  and  every  1  <  i\  <  12  <  ■  ■  ■  <  ik  <  n  we  have  that 


k  k  k 


Y  Ab  <  Ya^  +  Ysv^ 

j= 1  3= 1  i=  1 


where  {svt(E)}}= ,  are  the  singular  values  of  E  sorted  in  a  descending  order. 


As  a  corollary,  because  E  has  only  1  non-zero  singular  value,  we  denote  Big  =  {i  : 
A i  >  and  deduce  that  J2ieBig  A,;  —  at  <  1.  Similarly,  since  the  singular  values  of  E 
and  of  (-E)  are  the  same,  we  have  that  Yli&Big  a>  —  A*  <  1.  Using  this,  proving  (5.4)  is 
straight-forward: 


and  similarly,  \Jf\i  yk  <  e£°/2. 
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We  turn  to  proving  (5.5).  We  start  with  the  following  derivation. 


xT(ATA)~1x  -  xT(A'TA')~1x  =  xT(ATA)-1(A'TA')(A'TA')-1x  -  xT(A'TA')~1x  = 

=  xT(ATA)~1((A  +  E)t(A  +  E^iA^A'y'x  -  xT(A,TA,)~1x 
=  xT(ATA)-\ATE  +  ETA')(A,TA')~1x 


and  using  the  SVD  and  denoting  E  =  e,vT,  we  get 


xJ(ATA)~1x  -  xT(A,TA')~1x  =  xT  (VY^U7)  et  ■  vT  ( V'A~2V'T )  x 

+£t  (VY~2V7)  v  ■  e7  (U'A~1V'T)  x 

So  now,  assume  x  is  sampled  from  ATY.  (The  case  of  A,JY  is  symmetric.  In  fact, 
the  names  A  and  A 1  are  interchangeable.)  That  is,  assume  we’ve  sampled  y  from  Y  ~ 
J\f( 0,  Inxn)  and  we  have  x  =  ATy  =  VYU7y  and  equivalently  x  =  ( A /T  —  ET)y  = 
V'AU'7y  —  vejy.  The  above  calculation  shows  that 

| x7  (A*  A)~lx  —  x7  (A17  A’)~1x\  <  terrrii  ■  term2  +  term3  ■  terrrii 

where  for  i  =  1,  2,  3, 4  we  have  terrrii  =  \ veci  ■  y\  and 


vec\  =  UYV7VY~lUei  =  eu 
vec2  =  U' A~lV,Jv  -  eiVrV' A~2V,Tv, 

vec3  =  C/E-VTu, 

veci  =  e*  —  eiVTV'A~1U'Teii 


so  ||feci||  =  1 


so  ||vec2||  <  —  +  T2 

Ad  Ad 
1 

so  ||vec3||  <  — 

Od 

1 

so  ||nec4||  <  1  +  — 
Ad 


Recall  that  all  singular  values,  both  of  A  and  A' ,  are  greater  than  w  and  that  vec,  ■ 
y  ~  J\f( 0,  ||necj||2),  so  w.p.  >  1  —  <50  we  have  that  for  every  i  it  holds  that  terrrii  < 
N/ln(4/<Jo)||  veci\\  so 

I xT(AJA)-1x  -  xr(A,TA')-1x I  <  2(—  +  At)  ln(4M0)  <  4hl(4^o)  <  e0 
1  1  w  w2  w 

this  concludes  the  proof  in  our  first  stage. 
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Stage  2.  We  assume  step  4  was  applied,  and  denote  B  =  U(\fYA  +  w2I)VT  and  B'  = 
U'(V- A2  +  w2I)V'T.  We  denote  the  singular  values  of  B  and  B’  as  erf  >  erf  >  . . .  >  erf 
and  Af  >  Af  >  . . .  >  Af  resp.  Observe  that  by  definition,  for  every  i  we  have  (erf)2  = 
a2  +  tc2  and  (A f  )2  =  A2  +  w2. 

Again,  we  assume  we  output  0J  =  BJY ,  and  compare  =  £>TY'  to  X'  =  B’JY .  The 
theorem  merely  requires  Claim  5.13  to  hold,  and  they,  in  turn,  depend  on  the  following 
two  conditions. 


-eo/2  <  _  / det (B'TB')  ^  ^eo/2 


Pr, 


-  \x 


det  (BJB) 


<  ee 


((BJB)-1-{B,JB,)~1)x |  >e0/2 


A  50 


(5.6) 

(5.7) 


The  second  stage  deals  with  the  problem  that  now,  the  gap  A  =  B'  —  B  is  not  necessarily 
a  rank-1  matrix.  However,  what  we  show  is  that  all  stages  in  the  proof  of  Claim  5.13 
either  rely  on  the  singular  values  or  can  be  written  as  the  sum  of  a  few  rank-1  matrix 
multiplications. 

The  easier  part  is  to  claim  that  Eq.  (5.6)  holds.  The  analysis  is  a  simple  variation  on 
the  proof  of  Eq.  (5.4).  Fact  5.14  still  holds  for  the  singular  values  of  A  and  A' .  Observe 
that  A f  >  erf  iff  A,  >  rr, .  And  so  we  have 


and  the  remainder  of  the  proof  follows. 

We  now  turn  to  proving  Eq.  (5.7).  We  start  with  an  observation  regarding  A'1  A  and 
B'JB'. 


A'TA'  =  (A  +  E)T(A  +  E)  =  ATA  +  A'JE  +  ETA 
BAB  =  C(E2  +  w'2I)VT  =  VE2Vt  +  w2I  =  ATA  +  w2I 
B,tB'  =  V\K2  +  w2I)V,J  =  A,TA'  +  w2I 
=>  B'JB'  -  BAB  =  A,tE  +  EJA 


Now  we  can  follow  the  same  outline  as  in  the  proof  of  (5.5).  Fix  x,  then: 

x^^B^x  -  x^B^B'^x  =  xJ (BT B)-\B'T B')(B'J B'Y'x  -  x^B^B'^x  = 
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=  x\BJB)~l  [BTB  +  A'JE  +  ETA]  (B,TB')~lx 
-xJ(B'TB')~lx 

=  xT(BTB)~1  [AJE  +  ETA]  (B'J B')~lx 
=  xT(BJB)-1(AT  +  ET)ei  ■  vT(B,TB,)~1x 
+  xr(BTB)~1v  ■  ej  (A  -  E)  {B'^B'Y^x 

It  is  straight-forward  to  see  that  the  /-th  spectral  values  of  ( BTB)~1A  is  nE'wi  <  ^  l+  2  < 

1  /w,  and  similarly  for  the  spectral  values  of  (B'J  B')~l  A .  We  now  proceed  as  before  and 
partition  the  above  sum  into  multiplications  of  pairs  of  terms  where  ter  mi  <  \vecr  -y\,  and 
y  is  sampled  from  A/”(0,  Inxn )  and  x  =  BTy: 

xT(BTB)~1x  -  xJ(B'TB')-lx  =  yJ  [. B(BTB)-\AT  +  ET)ei]  ■  [vT  (B,J  B')'1  BT]  y 

+2/t  [B(BJ B)~lv  ]  •  [e](A'-E)(B'TB,)-1BT]y 


Lastly,  we  need  to  bound  all  terms  that  contain  the  multiplication  (B'T B')~1BTy  in 
comparison  to  (B'J B')-1  B'J y  =  B'^y-  For  instance,  take  the  term  =  \vecTy\  for  vecT  = 
ej  {A  —  E)  (B,J B1)-1  BT ,  and  define  it  as  vecT  =  zTBT.  We  can  only  bound  ||-B^||  using 
erf /(Af  )2,  whereas  we  can  bound  \\B'z\\  with  1/Xf  <  1  jw.  In  contrast  to  before,  we  do 
not  use  the  fact  that  BJy  =  {IT  —  A )Ty.  Instead,  we  make  the  following  derivations. 

First,  we  observe  that  for  every  vector  z  we  have  that  \\B'z\\  >  \\A' z\\  and  \\B' z\\  > 
ru||z||.  Using  the  fact  that  BJ B  —  B'J B'  =  —A,TE  —  ETA ,  a  simple  derivation  gives  that 
||LU||2  <  (||fT;?||  +  ||z||)2  <  (l  +  ^)2  1 1 B'z 1 1 2,  and  vice-versa.  Soifyiss.t.  > 

Threshold  then  >  Threshold.  Observe  that  zTBTy  is  distributed  like  A/”(0,  ||.Bx:||2) 

\\Bz\\Af(0, 1),  and  so  we  have  that  for  every  5'  >  0 


Pr 


=  Pr 


zTBTy |  >  y/log(l/6')  (^1  +  —J  \\B'z\\ 
(l  +  ||£'z||)  \zrBTy\  >  y/log(l/S') 


<  Pr 


(IIB^II)"1  \z'B'y\  >  Vlog(l JF) 


<  S 


□ 


Corollary.  Using  the  definitions  of  r  and  w  as  in  Algorithm  4  -  the  proof  of  Theo¬ 
rem  5.11  actually  shows  that  in  the  case  that  A  is  a  matrix  with  all  singular  values  >  w. 
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then  the  following  simple  algorithm  preserves  (e,  5) -differential  privacy:  pick  a  random 
r  x  n  matrix  M  whose  entries  are  iid  normal  Gaussians,  and  output  O  =  M A.  Further¬ 
more,  observe  that  if  ad,  the  least  singular  value  of  A,  is  bigger  than,  say,  10 w,  then  one 
can  release  ad  +  Lap(  1/e)  then  release  0  =  M A.  In  such  a  case,  users  know  that  for  any 
unit  vector  x  w.p.  >  1  —  u  it  holds  that  ^||Oa;||2  <  (1  ±  ?7)||Ae||2. 


Comment.  Comparing  Algorithms  2  and  4,  we  have  that  in  Lq  =  EqEq  we  “translate” 
the  spectral  values  by  w,  and  in  AT A  we  “translated”  the  spectral  values  by  w2.  This  is 
an  artifact  of  the  ability  to  directly  compare  the  spectal  values  of  Lq  and  La  in  the  first 
analysis,  whereas  in  the  second  analysis  we  compare  the  spectral  values  of  A  and  A'  (vs. 
ATA  and  A'1  A1).  This  is  why  the  noise  bounds  in  the  general  case  are  Oil/erj)  times 
worse  than  for  graphs. 


5.4.2  Comparison  with  Other  Algorithms 

To  the  best  of  our  knowledge,  no  previous  work  has  studied  the  problem  of  preserving 
the  variance  of  A  in  the  same  formulation  as  us.  We  deal  with  a  scenario  where  users 
pose  the  directions  on  which  they  wish  to  find  the  variance  of  A.  Other  algorithms,  that 
publish  the  PCA  or  a  low-rank  approximation  of  A  without  compromising  privacy  (see 
Section  6.1.1),  provide  users  with  specific  directions  and  variances.  These  works  are  not 
comparable  with  our  algorithm,  as  they  give  a  different  utility  guarantee.  For  example, 
low-rank  approximations  aim  at  nullifying  the  projection  of  A  in  certain  directions. 

Here,  we  compare  our  method  to  the  Laplace  mechanism,  the  Multiplicative  Weights 
mechanism  and  Randomized  Response.  The  bottom  line  is  clear:  our  method  allows  one  to 
answer  directional  variance  queries  with  additive  noise  which  is  independent  of  the  given 
input.  Other  methods  require  we  add  random  noise  that  depends  on  the  size  of  the  matrix, 
assuming  we  answer  polynomially  many  queries. 

Our  notation  is  as  follows,  n  denotes  the  number  of  rows  in  the  matrix  (number  of  in¬ 
dividuals  in  the  data),  d  denotes  the  number  of  columns  in  the  matrix,  and  we  assume  each 
entry  is  at  most  1.  As  before,  e  denotes  the  privacy  parameter  and  k  denotes  the  number 
of  queries.  Observe  that  we  (again)  compare  k  predetermined  queries  for  non-interactive 
mechanisms  with  k  adaptively  chosen  queries  for  interactive  ones.  The  remaining  param¬ 
eters  are  omitted  from  this  comparison.  Results  are  summarized  in  Table  5.2. 
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5.4.2.1  Naively  Adding  Laplace  Noise 

Again,  the  simplest  alternative  is  to  answer  each  directional- variance  query  with  <!>,*.  (A)  + 
Lap( 0,e-1)  for  a  suitable  value  of  e.  The  composition  theorem  of  [65]  assures  us  that 
to  answer  k  queries  while  overall  preserving  (e,  5) -differential  privacy,  we  must  set  e  < 
e/ Vk,  and  so  the  additive  error  per  query  is  0(y/k/e). 

5A.2.2  Randomized  Response 

We  now  consider  a  Randomized  Response  mechanism,  similar  to  the  Randomized  Re¬ 
sponse  mechanism  of  [77].  We  wish  to  output  a  noisy  version  of  ATA,  by  adding  some  iid 
random  noise  to  each  entry  of  AT A.  Since  we  call  two  matrices  neighbors  if  they  differ 
only  on  a  single  row,  denote  v  as  the  difference  vector  on  that  row.  It  is  simple  to  see  that 
by  adding  v  to  some  row  in  A,  each  entry  in  ATA  can  change  by  at  most  ||v||i.  Recall  that 
we  require  |  [  v  \  \  =  1  and  so  1 1  v  |  1 1  <  \fcl  Therefore,  we  have  that  in  order  to  preserve 
(e,  5) -differential  privacy,  it  is  enough  to  add  a  random  Gaussian  noise  of  J\f( 0,  d]o^d} )  to 
each  of  the  d 2  entries  of  A1  A. 

Next  we  give  the  utility  guarantee  of  the  Randomized  Response  scheme.  Fix  any  unit 
length  vector  x.  We  think  of  the  matrix  we  output  as  ATA  +  N,  where  Ar  is  a  matrix  of  iid 
samples  from  A/”(0,  dloAd>  y  Therefore,  in  direction  x,  we  add  to  the  true  answer  a  random 

noise  distributed  like  xTNx  ~  A/”(0,  =  A/”(0,  rfloJ^).  So  w.h.p  the 

noise  we  add  is  within  factor  of  0(y/d,/e)  for  each  query,  and  for  k  queries  it  is  within 
factor  of  O (  a J d  log(/c) /e) . 

5.4.2.3  The  Multiplicative  Weights  Mechanism 

It  is  not  straight-forward  to  adapt  the  Multiplicative  Weights  mechanism  to  answer  direc¬ 
tional  variance  queries.  We  represent  ATA  as  a  histogram  over  its  d2  entries  (so  the  size  of 
the  “universe”  is  N  =  d2),  but  it  is  not  simple  to  estimate  what  is  the  equivalent  of  num¬ 
ber  of  individuals  in  this  representation.  We  chose  to  take  the  pessimistic  bound  of  rid2, 
since  this  is  the  L\  bound  on  the  sum  of  entries  in  ATA,  but  we  comment  this  is  a  highly 
pessimistic  bound.  It  is  fairly  likely  that  the  number  of  individuals  in  this  representation 
can  be  set  to  only  0(d2). 

Plugging  these  parameters  into  the  utility  bounds  of  the  Multiplicative  Weights  mech¬ 
anism,  we  get  a  utility  bound  of  0(d^/n\og(k) / e).  Plugging  them  into  the  improved 
bounds  of  the  IDC  mechanism,  we  get  0(d^/nlog(k)/e).  Observe  that  even  if  replace  the 
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Method 

Additive  Error 

Multi¬ 

plicative 

Error? 

Inter¬ 

active? 

Tract¬ 

able? 

Laplace  Noise 

0(Vk/e) 

X 

/ 

/ 

Randomized  Re¬ 
sponse 

0(y/d  log(fc)/e) 

X 

X 

/ 

MW 

IDC 

0(dy/n  log (k)/e) 
d(d(/n\og(k)/e) 

X 

/ 

/ 

JL 

0(log(fc)/e2) 

/ 

X 

/ 

Table  5.2:  Comparison  between  mechanisms  for  answering  directional  variance  queries, 
pessimistic  bound  of  nd2  with  just  d2,  these  bounds  depend  on  d. 


5.4.2A  Our  Algorithm 

Our  algorithm’s  utility  is  computed  simply  by  plugging  in  v  =  0(1 /k)  to  Theorem  5.12, 
which  gives  a  utility  bound  of  0(\og(k) / e2). 


5.5  Discussion  and  Open  Problems 

The  fact  that  the  JL  transform  preserves  differential  privacy  is  likely  to  have  more  theoret¬ 
ical  and  practical  applications  than  the  ones  detailed  in  this  chapter.  Below  we  detail  a  few 
of  the  open  questions  we  find  most  compelling. 


Error  depedency  on  r.  Our  algorithm  projects  the  edge-matrix  of  a  given  graph  on  r 
random  directions,  then  publishes  these  projections.  The  value  of  r  determines  the  prob¬ 
ability  we  give  a  good  approximation  to  a  given  cut-query,  and  provided  that  we  wish  to 
give  a  good  approximation  to  all  cut-queries,  our  analysis  requires  us  to  set  r  =  Q(n).  But 
is  it  just  an  artifact  of  the  analysis?  Could  it  be  that  a  better  analysis  gives  a  better  bound 
on  r?  It  turns  out  that  the  answer  is  “no”.  In  fact,  the  direction  on  which  we  project  the 
data  now  have  high  correlation  with  the  published  Laplacian.  We  demonstrate  this  with  an 
example. 

Assume  our  graph  is  composed  of  a  single  perfect  matching  between  2 n  nodes,  where 
node  i  is  matched  with  node  n  +  i.  Focus  on  a  single  random  projection  -  it  is  chosen  by 
picking  (22n)  iid  random  values  ~  J\f( 0, 1),  and  for  the  ease  of  exposition  imagine  that 
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the  values  of  the  edges  in  the  matching  are  picked  first,  then  the  values  of  all  other  pairs 
of  vertices.  Now,  if  we  pick  the  value  xitn+i  for  the  (i,  n  +  i )  edge,  then  node  i  is  assigned 
xitn+i  while  node  n  +  i  is  assigned  —  x^n+i.  So  regardless  of  the  sign  of  xitn+i,  exactly 
one  of  the  two  nodes  {i,  n  +  i}  is  assigned  the  positive  value  \xitn+i\  and  exactly  one  is 
assigned  the  negative  value  —  \xitn+i\.  Define  S  as  the  set  of  n  nodes  that  are  assigned  the 
positive  values  and  S  as  the  set  of  n  nodes  that  are  assigned  the  negative  values.  The  sum 
of  weight  crossing  the  ( S ,  5) -cut  is  distributed  like  ( X  +  - Y )2  where  X  =  Yi  \%i,n+i\  and 
Y  =  Y%  Yj^n+i  xi,j •  Indeed,  Y  is  the  sum  of  n(n  —  1)  random  normal  iid  Gaussians,  but 
X  is  the  sum  of  n  absolute  values  of  Gaussians.  So  w.h.p.  both  X  and  Y  are  proportional 
to  n.  Therefore,  in  the  direction  of  this  particular  random  projection  we  estimate  the 
(S,  5) -cut  as  Q((n  ±  w )2)  =  D(n2)  rather  than  0(n).  (If  X  was  distributed  like  the  sum 
of  n  iid  normal  Gaussians,  then  the  estimation  would  be  proportional  to  (\/Ti)2  =  n.) 

Assuming  that  the  remaining  r  —  1  projections  estimate  the  cut  as  0(n),  then  by  av¬ 
eraging  over  all  r  random  projections  our  estimation  of  the  (S,  S') -cut  is  oj(n),  as  long  as 

r  =  o(n). 


Error  amplification  or  error  detection.  Having  established  that  we  do  err  on  some 
cuts,  we  pose  the  question  of  error  amplification.  Can  we  introduce  some  error-correction 
scheme  to  the  problem  without  increasing  r  significantly?  Error  amplification  without 
increasing  r  will  allow  us  to  keep  the  additive  error  fairly  small.  One  can  view  L  as  a 
coding  of  answers  to  all  2”  cut-queries  which  is  guaranteed  to  have  at  least  l  —  is  fraction 
of  the  code  correct,  in  the  sense  that  we  get  a  (77,  r) -approximation  to  the  true  cut-query 
answer.  As  such,  it  is  tempting  to  try  some  self-correcting  scheme  -  like  adding  a  random 
vector  x  to  the  vector  ls,  then  finding  the  estimation  to  xTLGx  and  (ls  +  x)TLGx  and 
inferring  We  were  unable  to  prove  such  scheme  works  due  to  the  dot-product 

problem  (see  next  paragraph)  and  to  query  dependencies. 

A  related  question  is  of  error  detection:  can  we  tell  whether  L  gives  a  good  estimation 
to  a  cut  query  or  not?  One  potential  avenue  is  to  utilize  the  trivial  guess  for  (1>G(S)  -  the 
expected  value  XYs(n  —  s)  (we  can  release  m  via  the  Laplace  mechanism).  We  believe 

(2) 

this  question  is  related  to  the  problem  of  estimating  the  variance  of  {<&G(S')  :  \S\  —  .s}. 


Edges  between  S  and  T.  Our  work  assures  utility  only  for  cut-queries.  It  gives  no  utility 
guarantees  for  queries  regarding  E(S,T),  the  set  of  edges  connecting  two  disjoint  vertex- 
subsets  S  and  T.  The  reason  is  that  it  is  possible  to  devise  a  graph  where  both  E(S,  S)  and 
E(T,  T)  are  large  whereas  E(S,  T)  is  fairly  small.  When  E(S,  S')  and  E(T.  T)  are  big,  the 
multiplicative  error  77  given  to  both  quantities  might  add  too  much  noise  to  an  estimation 
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of  E(S,  T). 

The  problem  relates  to  the  dot-product  estimation  of  the  JL  transform.  It  is  a  clas¬ 
sical  result  that  if  M  is  a  distance-preserving  matrix  and  u  and  v  are  two  vectors  s.t. 
|| M(u  +  n)||2  «  || u  +  n||2  and  \\M(u  —  n)||2  «  || u  —  y||2  then  it  is  possible  to  bound 
the  difference  | Mu  ■  Mv  —  u  ■  u|.  But  this  bound  is  a  function  of  \\u\\  and  1 1 v \ | ,  which 
in  our  case  translates  to  a  bound  that  depends  on  ||f?Gls||  and  ||i?GlT||,  both  vectors  of 
potentially  large  norms. 


Other  Versions  of  JL.  The  analysis  in  this  works  deals  with  the  most  basic  JL  trans¬ 
form,  using  normal  Gaussians.  We  conjecture  that  qualitatively  the  same  results  should 
apply  for  most  of  the  other  versions  of  the  JL  transform  that  utilize  a  dense  transform 
(e.g.,  using  unit-length  projections  or  uniformly  chosen  entries  in  [—1, 1]).  Notice  how¬ 
ever  that  some  versions  of  the  JL  clearly  do  not  preserve  privacy.  For  example,  the  version 
of  Achlioptas  [3]  where  each  entry  is  chosen  uniformly  to  be  1  or  —1,  clearly  does  not 
preserve  privacy  -  it  is  easy  to  differentiate  between  the  complete  graph  Kn  and  its  neigh¬ 
bor  based  on  observing  whether  a  specific  coordinate  is  even  or  odd.  It  is  an  interesting 
open  problem  to  see  whether  sparse  JL  transforms  (see  [55])  preserve  differential  privacy 
or  not. 
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Chapter  6 

Differentially  Private  Data  Analysis  of 
Social  Networks  via  Restricted 
Sensitivity 

6.1  Introduction 


The  social  networks  we  inhabit  have  grown  significantly  in  recent  decades  with  digital 
technology  enabling  the  rise  of  networks  like  Facebook  that  now  connect  over  900  million 
people  and  house  vast  repositories  of  personal  information.  At  the  same  time,  the  study  of 
various  characteristics  of  social  networks  has  emerged  as  an  active  research  area  [66].  Yet 
the  fact  that  the  data  in  a  social  network  might  be  used  to  infer  sensitive  details  about  an 
individual,  like  sexual  orientation  [92],  is  a  growing  concern  among  social  networks’  par¬ 
ticipants.  Even  in  an  ‘anonymized’  unlabeled  graph  it  is  possible  to  identify  people  based 
on  graph  structures  [23].  Here  we  study  the  feasibility  of  and  design  efficient  algorithms 
to  release  statistics  about  social  networks  (modeled  as  graphs  with  vertices  labeled  with 
attributes)  while  satisfying  the  semantic  definition  of  differential  privacy  [63,  62]. 

A  differentially  private  mechanism  guarantees  that  any  two  neighboring  data  sets  (i.e., 
data  sets  that  differ  only  on  the  information  about  a  single  individual)  induce  similar  dis¬ 
tributions  over  the  statistics  released.  For  social  networks,  which  are  graphs  with  node 
coloring,  we  consider  two  notions  of  neighboring  or  adjacent  networks:  (1)  edge  adja¬ 
cency  stipulating  that  adjacent  graphs  differ  in  just  one  edge  or  in  the  attributes  of  just 
one  vertex;  and  (2)  vertex  adjacency  stipulating  that  adjacent  networks  differ  on  just  one 
vertex — its  attributes  or  any  number  of  edges  incident  to  it.  We  comment  that  the  edge 
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adjacency  model  considered  here  is  slightly  different  than  the  edge-adjacency  model  con¬ 
sidered  in  Chapter  5  as  we  also  allow  a  node  to  change  its  attributes. 

For  any  given  statistic  or  query,  its  global  sensitivity  measures  the  maximum  difference 
in  the  answer  to  that  query  over  all  pairs  of  neighboring  data  sets  [63];  global  sensitivity 
provides  an  upper  bound  on  the  amount  of  noise  that  has  to  be  added  to  the  actual  statistic 
in  order  to  preserve  differential  privacy.  Since  the  global  sensitivity  of  certain  types  of 
queries  can  be  quite  high,  the  notion  of  smooth  sensitivity  was  introduced  to  reduce  the 
amount  of  noise  that  needs  to  be  added  while  still  preserving  differential  privacy  [119]. 

However,  a  key  challenge  in  the  differentially  private  analysis  of  social  networks  is 
that  for  many  natural  queries,  both  global  and  smooth  sensitivity  can  be  very  large.  In 
the  vertex  adjacency  model,  consider  the  query  “How  many  people  in  G\  are  a  doctor 
or  are  friends  with  a  doctor?”  Even  if  the  answer  is  0  (e.g.,  there  are  no  doctors  in  the 
social  network)  there  is  a  neighboring  social  network  G2  in  which  the  answer  is  n  (e.g., 
pick  an  arbitrary  person  from  G\,  relabel  him  as  a  doctor,  and  connect  him  to  everyone). 
Even  in  the  edge  adjacency  model,  the  sensitivity  of  queries  may  be  high.  Consider  the 
query  “How  many  people  in  G\  are  friends  with  two  doctors  who  are  also  friends  with 
each  other?”  In  G\  the  answer  may  be  0  even  if  there  are  two  doctors  that  everyone  else 
is  friends  with  (e.g,  the  doctors  are  not  friends  with  each  other),  but  the  answer  jumps  to 
n  —  2  in  a  neighboring  graph  G2  (e.g,  if  we  simply  connect  the  doctors  to  each  other). 

In  fact,  such  queries  have  high  global  sensitivity  in  the  edge-adjacency  model  we  con¬ 
sider  here,  because  this  model  allows  for  a  node  to  change  its  attributes  arbitrarily.  For 
example,  the  global  sensitivity  of  the  query  “how  many  people  are  friends  with  a  doctor” 
is  Vt(n)  -  as  illustrated  by  the  two  adjacent  star  graphs  (one  node  with  degree  n  —  1  and 
all  other  nodes  with  degree  1)  where  in  one  no  node  is  a  doctor  and  in  the  other  the  central 
node  has  a  M.D.  Therein  lies  the  difference  between  the  problem  considered  in  Chapter  5 
and  this  chapter.  In  Chapter  5  we  consider  only  the  graph  structure,  or  alternatively  a  social 
networks  where  the  labeling  of  each  node  is  public  knowledge  -  and  so  cut-queries  have 
global  sensitivity  of  1.  Here  we  consider  possible  changes  both  to  the  graph  structure  and 
to  the  labeling  of  the  nodes,  so  cut-queries  may  have  a  sensitivity  of  fi(n).  For  example, 
consider  the  query  “do  Chinese  people  have  many  non-Chinese  friends  on  average?”  If  the 
labeling  of  each  node  is  public  we  may  use  any  of  the  techniques  detailed  in  Chapter  5  to 
answer  such  query.  However,  if  the  labeling  of  each  node  is  private  (we  want  each  person 
to  be  able  to  pretend  she  is  Chinese),  the  only  existing  differentially  privacy  technique  that 
might  answer  this  query  with  o{n )  error  is  the  framework  of  smooth-sensitivity  [119]. 

Yet,  while  the  examples  obtaining  global  sensitivity  of  n  respect  the  mathematical 
definitions  of  neighboring  graphs  and  networks,  we  note  that  in  a  real  social  network 
no  single  individual  is  likely  to  be  directly  connected  with  everyone  else.  Suppose  that 
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in  fact  a  querier  has  some  such  belief  H  about  the  given  network  (H  is  a  subset  of  all 
possible  networks)  such  that  its  query  /  has  low  sensitivity  restricted  only  to  inputs  and 
deviations  within  H.  For  example,  the  querier  may  believe  the  following  hypothesis  (' Hk )'■ 
the  maximum  degree  of  any  node  in  the  network  is  at  most  k  =  5000  <C  n  ~  9  x  108  (e.g, 
after  reading  a  study  on  the  anatomy  of  Facebook  [134]).  Can  one  in  that  case  provide 
accurate  answers  in  the  event  that  indeed  G  e  H  and  yet  preserve  privacy  no  matter  what 
(even  if  H  is  not  satisfied)? 

Here  we  provide  a  positive  answer  to  this  question.  We  do  so  by  introducing  the  notion 
of  restricted  sensitivity ,  which  represents  the  sensitivity  of  the  query  /  over  only  the  given 
subset  H,  and  providing  procedures  that  map  a  query  /  to  an  alternative  query  fy  s.t.  / 
and  fy  identify  over  the  inputs  in  H,  yet  the  global  sensitivity  of  fy  is  comparable  to  just 
the  restricted  sensitivity  of  /.  Therefore,  the  mechanism  that  answers  according  to  fy  and 
adds  Laplace  random  noise  preserves  privacy  for  all  inputs,  while  giving  good  estimations 
of  /  for  inputs  in  H. 

While  our  general  scheme  for  devising  such  fy  is  inefficient  and  requires  that  we  con¬ 
struct  a  separate  fy  for  each  query  /,  we  also  design  a  complementary  projection-based 
approach.  A  projection  of  H  is  a  function  mapping  all  possible  inputs  (e.g.,  all  possible 
n-node  social  networks)  to  inputs  in  j-L  with  the  property  that  any  input  in  j-L  is  mapped 
to  itself.  Therefore,  a  projection  //  allows  us  to  define  fy  for  any  /,  simply  by  composing 
fy  —  f  °  p.  Moreover,  if  this  projection  //  satisfies  certain  smoothness  properties,  which 
we  define  in  Section  6.4,  then  this  function  fy  will  have  its  global  sensitivity — or  at  least 
its  smooth  sensitivity  over  inputs  in  H — comparable  to  only  the  restricted  sensitivity  of  /. 
In  particular,  for  the  case  j-L  =  Hk  (the  assumption  that  the  network  has  degree  at  most 
k  <C  n),  we  show  we  can  efficiently  construct  projections  p  satisfying  these  conditions, 
therefore  allowing  us  to  efficiently  take  advantage  of  low  restricted  sensitivity.  These  re¬ 
sults  are  given  in  Section  6.4  and  summarized  in  Table  6.1. 

The  next  natural  question  is:  how  much  advantage  does  restricted  sensitivity  provide, 
compared  to  global  or  smooth  sensitivity,  for  natural  query  classes  and  natural  sets  HI 
In  Section  6.5  we  consider  two  natural  classes  of  queries:  local  profile  queries  and  sub¬ 
graph  counting  queries.  A  local  profile  query  asks  how  many  nodes  v  in  a  graph  satisfy  a 
property  which  depends  only  on  the  immediate  neighborhood  of  v  (e.g,  queries  relating  to 
clustering  coefficients  and  bridges  [66],  or  queries  such  as  “how  many  people  know  two 
spies  who  don’t  know  each  other?”).  A  subgraph  counting  query  asks  how  many  copies  of 
a  particular  subgraph  P  are  contained  in  the  network  (e.g.,  number  of  triangles  involving 
at  least  one  spy).  For  the  case  H  =  Hk  for  k  <C  n  we  show  that  the  restricted  sensitivity 
of  these  classes  of  queries  can  indeed  be  much  lower  than  the  smooth  sensitivity.  These 
results,  presented  in  Section  6.5,  are  summarized  in  Table  6.2. 
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Adjacency 

Hypothesis 

Query 

Sensitivity 

Efficient 

Theorem  6.5 

Any 

Any 

Any 

GSfn  =  RSf  (R) 

No 

Theorem  6.10 

Edge 

Rk 

Any 

GSJn  =  3 RS,  (H) 

Yes 

Theorem  6.16 

Vertex 

Rk 

Any 

S,n  =  0  (1)  x  RS,  (HjJ 

Yes 

Table  6.1:  Summary  of  Results.  GS  =  global  sensitivity,  RS  =  restricted  sensitivity,  and 
S  =  smooth  bound  of  local  sensitivity. 


Subgraph  Counting  Query  P 

Local  Profile  Query 

Adjacency 

Smooth 

Restricted 

Smooth 

Restricted 

Edge 

\P\  k l^'-1 

\P\ 

k  +  1 

k  +  1 

Vertex 

0  (n'^l-1) 

P  A;|P|_1 

n  —  1 

2k  +  1 

Table  6.2:  Worst  Case  Smooth  Sensitivity  over  Rk  vs.  Restricted  Sensitivity  RSf  {Rk)- 

6.1.1  Related  Work 

Easley  and  Kleinberg  provide  an  excellent  summary  of  the  rich  literature  on  social  net¬ 
works  [66].  Previous  literature  on  differentially-private  analysis  of  social  networks  has 
primarily  focused  on  the  edge  adjacency  model  in  unlabeled  graphs  where  sensitivity  is 
manageable.  Triangle  counting  queries  can  be  answered  in  the  edge  adjacency  model  by 
efficiently  computing  the  smooth  sensitivity  [119],  and  this  result  can  be  extended  to  an¬ 
swer  other  counting  queries  [100].  Hay  et  al  [84]  show  how  to  privately  approximate  the 
degree  distribution  in  the  edge  adjacency  model.  The  Johnson-Lindenstrauss  transform 
can  be  used  to  answer  all  cut  queries  in  the  edge  adjacency  model  [41].  Kasiviswanathan, 
Nissim,  Raskhodnikova  and  Smith  [102]  have  independently  been  exploring  and  devel¬ 
oped  an  analysis  for  node  level  privacy  using  an  approach  similar  to  ours. 

The  approach  taken  in  the  work  of  Rastogi  et  al.  [125]  on  answering  subgraph  count¬ 
ing  queries  is  the  most  similar  to  ours.  They  consider  a  bayesian  adversary  whose  prior 
(background  knowledge)  is  drawn  from  a  distribution.  Leveraging  an  assumption  about 
the  adversary’s  prior  they  compute  a  high  probability  upper  bound  on  the  local  sensitivity 
of  the  data  and  then  answer  by  adding  noise  proportional  to  that  bound.  Loosely,  they 
assume  that  the  presence  of  an  edge  does  not  presence  of  other  edges  more  likely.  In  the 
specific  context  of  a  social  network  this  assumption  is  widely  believed  to  be  false  (e.g., 
two  people  are  more  likely  to  become  friends  if  they  already  have  common  friends  [66]). 
The  privacy  guarantees  of  [125]  only  hold  if  these  assumptions  about  the  adversaries  prior 
are  true.  By  contrast,  we  always  guarantee  privacy  even  if  the  assumptions  are  incorrect. 

A  relevant  approach  that  deals  with  preserving  differential  privacy  while  providing 
better  utility  guarantees  for  nice  instances  is  detailed  in  the  work  of  Nissim  et  al  [119] 
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who  define  the  notion  of  smooth  sensitivity.  In  their  framework,  the  amount  of  random 
noise  that  the  mechanism  adds  to  a  query’s  true  ansewr  is  dependent  on  the  extent  for 
which  the  input  database  is  “nice”  -  having  small  local  sensitivity.  As  we  discuss  later,  in 
social  networks  many  natural  queries  (e.g.,  local  profile  queries)  even  have  high  local  and 
smooth  sensitivity. 


6.2  Preliminaries  -  Graphs  and  Social  Networks 

Our  work  is  motivated  by  the  challenges  posed  by  differentially  private  analysis  of  social 
networks.  As  always,  a  graph  is  a  pair  of  a  set  of  vertices  and  a  set  of  edges  G  =  (V,  E). 
We  often  just  denote  a  graph  as  G,  referring  to  its  vertex-set  or  edge-set  as  V (G)  or  E(G) 
resp.  A  key  aspect  of  our  work  is  modeling  a  social  network  as  a  labeled  graph. 

Definition  6.1.  A  social  network  (G,  £)  is  a  graph  with  labeling  function  £  :  V  (G)  —>  Rm. 
The  set  of  all  social  networks  is  denoted  Q. 

The  labeling  function  i  allows  us  to  encode  information  about  the  nodes  (e.g.,  age, 
gender,  occupation).  For  convenience,  we  assume  all  social  networks  are  over  the  same 
set  of  vertices,  which  is  denotes  as  V  and  has  size  n,  and  so  we  assume  \V\  —  n  is  public 
knowledge.1  Therefore,  the  graph  structures  of  two  social  networks  are  equal  if  their  edge- 
sets  are  identical.  Similarly,  we  also  fix  the  dimension  m  of  our  labeling. 

Defining  differential  privacy  over  the  labeled  graphs  Q  requires  care.  What  does  it 
mean  for  two  labeled  graphs  G\,  G2  €  Q  to  be  neighbors?  There  are  two  natural  notions: 
edge-adjacency  and  vertex  adjacency. 

Definition  6.2  (Edge-adjacency).  We  say  that  two  social  networks  (Gj ,  (j  )  and  (G2-  G>) 
are  neighbors  if  either  (i)  E(G\)  =  E(G2)  and  there  exists  a  vertex  u  such  that  (j  (u)  f 
£2{u)  whereas  for  every  other  v  f  u  we  have  (v)  =  i2  {v)  or  (ii)  Vv,  £\{v)  =  i2  (v)  and 
the  symmetric  difference  E(G  \ )  A  E(G2)  contains  a  single  edge. 

In  the  context  of  a  social  network,  differential-privacy  w.r.t  edge-adjacency  can,  for 
instance,  guarantee  that  an  adversary  will  not  be  able  to  distinguish  whether  a  particular 
individual  has  friended  some  specific  pop-singer  on  Facebook.  However,  such  guarantees 
do  not  allow  a  person  to  pretend  to  listen  only  to  high-end  indie  rock  bands,  should  that 
person  have  friended  numerous  pop-singers  on  Facebook.  This  motivates  the  stronger 
vertex- adjacency  neighborhood  model. 

1  Adding  or  removing  vertices  could  be  done  by  adding  one  more  dimension  to  the  labeling,  indicating 
whether  a  node  is  active  or  inactive. 
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Definition  6.3  (Vertex-adjacency).  We  say  that  two  social  networks  (G, ,  if)  and  (G2.  i2 ) 
are  neighbors  if  there  exists  a  vertex  vt  such  that  G  \  —  vt  —  G2  —  vr  and  l\  (vf  =  i2(vf) 
for  every  Vj  f 


where  for  a  graph  G  and  a  vertex  v  we  denote  G  —  v  as  the  result  of  removing  every  edge 
in  E(G)  that  touches  v. 

It  is  evident  that  any  two  social  networks  that  are  edge-adjacent  are  also  vertex- adjacent. 
Preserving  differential  privacy  while  guaranteeing  good  utility  bounds  w.r.t  vertex-adjacency 
is  a  much  harder  task  than  w.r.t  edge-adjacency. 


Distance  Given  two  social  networks  (Gi,  i\  )  and  (G2.i2),  recall  that  their  distance  is 
the  minimal  k  s.t.  one  can  form  a  path  of  length  k,  starting  with  (Gi,  (j )  and  ending 
at  (G2,  if)-,  with  the  property  that  every  two  consecutive  social-networks  on  this  path  are 
adjacent.  Given  the  above  two  definitions  of  adjacency,  we  would  like  to  give  an  alternative 
characterization  of  this  distance. 

First  of  all,  the  set  U  —  {v  :  ifv)  f  f2(^)}  dictates  \U\  steps  that  we  must  take 
in  order  to  transition  from  (Gi,  if)  to  (G2,  if).  It  is  left  to  determine  how  many  adjacent 
social-networks  we  need  to  transition  through  until  we  have  E(Gf)  =  E{Gf).  To  that  end, 
we  construct  the  difference- graph  whose  edges  are  the  symmetric  difference  of  E{Gf)  and 
E(Gf).  Clearly,  to  transition  from  (G 1,  if)  to  (G2,  if),  we  need  to  alter  every  edge  in  the 
difference  graph.  In  the  edge- adjacency  model,  a  pair  of  adjacent  social  networks  covers 
precisely  a  single  edge,  and  so  it  is  clear  that  the  distance  d({Gi,if),  (G2,f2))  =  \U\  + 
\E(Gf)  A  E(Gf)\.  In  the  vertex- adjacency  model,  a  single  vertex  can  cover  all  the  edges 
that  touch  it,  and  so  the  distance  between  the  graphs  G\  —  U  and  G-2  —  U  is  precisely  the 
vertex  cover  of  the  difference  graph.  Denoting  this  vertex  cover  as  VC (G 1  —  U  A  G2  —  U) 
we  have  that  d{[G\,if),  (G2,£2))  =  \U\  +  \VC(Gi  —  U  A  G2  —  U) |.  It  is  evident  that 
computing  the  distance  of  between  any  two  social-networks  in  the  vertex- adjacency  model 
is  a  NP-hard  problem. 

To  avoid  cumbersome  notation,  throughout  this  entire  chapter  we  omit  the  differentia¬ 
tion  between  graphs  and  social  networks,  and  denote  networks  as  graphs  G  £  Q. 


6.3  Restricted  Sensitivity 

We  now  introduce  the  notion  of  restricted  sensitivity,  using  a  hypothesis  about  the  dataset 
D  to  restrict  the  sensitivity  of  a  query.  A  hypothesis  %  is  a  subset  of  the  set  D  of  all 
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possible  datasets  (so  in  the  context  of  social  networks,  FL  is  a  set  of  labeled  graphs).  We 
say  that  FL  is  true  if  the  true  dataset  D  e  FL.  Because  the  hypothesis  FL  may  not  be  a  convex 
set  we  must  consider  all  pairs  of  datasets  in  FL  instead  of  all  pairs  of  adjacent  datasets  as 
in  the  definition  of  global  sensitivity. 


Definition  6.4.  For  a  given  notion  of  adjacency  among  datasets,  the  restricted  sensitivity 
of  f  over  a  hypothesis  Ft.  (I'D  is 


RS,  (H) 


max 

D1,D2€H 


d{D1,D2)  > 


To  be  clear,  d(D1,D2)  denotes  the  length  of  the  shortest-path  in  V  between  D\  and 
D2  (not  restricting  the  path  to  only  use  D  e  FL)  using  the  given  notion  of  adjacency  (e.g., 
edge-adjacency  or  vertex-adjacency).  That  is,  we  restrict  the  set  of  databases  for  which  we 
compute  the  sensitivity,  but  we  do  not  re-define  the  distances. 

Observe  that  RSf  (FL)  may  be  smaller  than  LSj  (D)  for  some  1)  e  Ft  if  D  has  a 
neighbor  D '  FL.  In  fact  we  often  have  LSj  ( D )  >  \f  ( D )  —  f  (D')\  RSf  (FL). 
In  such  cases  RSf  (FL)  will  be  significantly  lower  than  any  /3-smooth  upper  bound  on 
LSf  ( D ).  For  example,  consider  the  query  /  defined  as  “how  many  people  are  friends 
with  a  doctor”  under  the  vertex  adjacency  model.  For  any  social  network  with  at  least  one 
doctor  there  exists  a  neighboring  network  where  /  takes  the  value  n  -  the  one  where  the 
doctor  node  connects  to  all  other  nodes.  Therefore,  even  if  the  given  network  is  such  that 
everyone  has  no  more  than  k  friends,  it  still  holds  that  the  local  sensitivity  of  /  is  high. 
In  contrast,  by  hypothesizing  that  the  given  network  is  such  that  no  one  has  more  than  k 
friends,  the  restricted  sensitivity  of  /  is  0(k).  Further  discussion  is  given  in  Section  6.5. 


6.4  Using  Restricted  Sensitivity  to  Reduce  Noise 

To  achieve  differential  privacy  while  adding  noise  proportional  to  RSf  (FL)  we  must  be 
willing  to  sacrifice  accuracy  guarantees  for  datasets  D  jL  FL.  Our  goal  is  to  create  a  new 
query  fy  such  that  fy(D)  =  f(D)  for  every  D  e  'FL.  (fy  is  accurate  when  the  hypothesis  is 
correct)  and  fy  either  has  low  global  sensitivity  or  low  /3-smooth  sensitivity  over  datasets 
D  e  FL.  (Again  we  emphasize  -  the  global  sensitivity  of  fy  is  defined  w.r.t  to  any  pair 
of  neighboring  datasets  and  not  only  neighboring  datasets  in  FL.  Similarly,  the  smooth 
sensitivity  of  fy(D)  is  w.r.t  any  of  the  neighbors  of  I)  regardless  of  whether  they  belong 
to  FL  or  not.) 

In  this  section,  we  first  give  a  non-efficient  generic  construction  of  such  fy,  showing 
that  it  is  always  possible  to  devise  fy  whose  global  sensitivity  equals  exactly  the  restricted 
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sensitivity  of  /  over  PL.  We  then  show  how  for  the  case  of  social  networks  and  for  the  hy¬ 
pothesis  Hk  that  the  network  has  bounded  degree,  we  can  construct  functions  fyk  having 
approximately  this  property,  efficiently. 


6.4.1  A  General  Construction 

We  now  show  how  given  PL.  to  generically  (but  not  efficiently)  construct  fy  whose  global 
sensitivity  exactly  equals  the  restricted  sensitivity  of  /  over  PL.  (The  theorem  we  give  can 
be  viewed  as  a  special  case  of  constructing  a  Lipschitz  extension  of  f\y  over  V  -  i.e., 
mimicking  either  the  Whitney  or  McShane  extensions,  see  [94]  for  details.) 

Theorem  6.5.  Given  any  query  f  and  any  hypothesis  PL  C  D  we  can  construct  a  query 
fy  such  that 


1.  V79  G  PL  it  holds  that  fy  ( D )  =  /  (79),  and 

2.  GS,n  =  RS,  f H) 


Proof.  For  each  l)  e  PL  set  fy  ( I) )  =  f  ( I) ) .  Now  fix  an  arbitrary  ordering  of  the  set 
{D  :  D  PL},  and  denote  its  elements  as  A,  D2l  ■  ■  ■ ,  Dm,  where  m  is  the  size  of 
the  set.  For  every  D  ^  PL  we  define  the  value  of  fy{D)  inductively.  Denote  T,  = 
PL[j{Di, Di}.  Initially,  we  are  given  the  values  of  every  D  e  To-  Given  i  >  0, 
we  denote  A*  =  RSjh  ('77)  •  We  now  prove  one  can  pick  the  value  fy  ( 79, )  in  a  way  that 
preserves  the  invariant  that  Ai+i  =  A*.  By  applying  the  induction  m  times  we  conclude 
that 

RSf  (n)  =  A0  =  Am  =  RSfn  (V)  =  GSfn  . 


Fix  i  >  0.  Observe  that 


Ai+i  =  max 


(Ai’( 


max 

Da  Ti 


\fn{D)  ~  Sn{Di+i)\\ 

d(D,Di+i)  ) 


so  to  preserve  the  invariant  it  suffices  to  find  any  value  of  fy  {Di+ 1)  that  satisfies  that  for 
every  D  e  7)  we  have  |  fy  ( D )  —  fy  (A+i)|  A  A  yd  (79,  Di+f).  Suppose  for  contradiction 
that  no  value  exists.  Then  there  must  be  two  intervals 


[fy  (D\)  -  Ai-  d  (Dl,  A+i) ,  fn  (Dt)  +  A,  •  d  (Df  A+i)] 
[fy  (D*)  -  A,  -d(D*,Di+1) ,  fy  (A)  +  A,  •  d(D*,  A+i)] 
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which  don’t  intersect.  This  would  imply  that 


i/«(q;)-/h(q;)U  \MT1GMM 

d(D\,D’2)  ~  d(Dl+i,D\)  +  d(Di+1,Di) 


which  contradicts  the  fact  that  A*  is  the  restricted  sensitivity  of ‘T,  . 


6.4.2  Efficient  Procedures  for  Hk  via  Projection  Schemes 

Unfortunately,  the  construction  of  Theorem  6.5  is  highly  inefficient.  Furthermore,  this 
construction  deals  with  one  query  at  a  time.  We  would  like  to  a-priori  have  a  way  to 
efficiently  devise  fy  for  any  /.  In  this  section,  the  way  we  devise  fy  is  by  constructing  a 
projection  -  a  function  p  :  V  — >  H  with  the  property  that  p{D)  =  D  for  every  D  e  H. 
Such  p  allows  us  to  canonically  convert  any  f  into  fy  using  the  naive  definition  fy  =  fop. 
Below  we  discuss  various  properties  of  projections  that  allow  us  to  derive  “good”  fy- s. 
Following  each  property,  we  exhibit  the  existence  of  such  projections  p  for  the  specific 
case  of  social  networks  and  H  =  Hk ,  the  class  of  graphs  of  degree  at  most  k. 

Definition  6.6.  The  class  Hk  is  defined  as  the  set  {G  G  Q  :  Vn,  deg(u)  <  k}. 

In  many  labeled  graphs,  it  is  reasonable  to  believe  that  Hk  holds  for  k  <C  n  because  the 
degree  distributions  follow  a  power  law.  For  example,  the  number  of  telephone  numbers 
receiving  t  calls  in  a  day  is  proportional  to  1/t2,  and  the  number  of  web  pages  with  t 
incoming  links  is  proportional  to  1/t2  [118,  46,  66].  For  these  networks  it  would  suffice 
to  set  k  =  O  (\fn.).  The  number  of  papers  that  receive  t  citations  is  proportional  to  l/t3 
so  we  could  set  k  =  O  (\fn)  [66].  While  the  degrees  on  Facebook  don’t  seem  to  follow 
a  power  law,  the  upper  bound  k  =  5,  000  seems  reasonable  [134].  By  contrast,  Facebook 
had  approximately  n  =  901,  000, 000  users  in  June,  2012  [1]. 


6.4.2.1  Smooth  Projection 

The  first  property  we  discuss  is  perhaps  the  simplest  and  most  coveted  property  such  pro¬ 
jection  can  have  -  smoothness.  Smoothness  dictates  that  there  exists  a  global  bound  on 
the  distance  between  any  two  mappings  of  two  neighboring  databases. 

Definition  6.7.  A  projection  p  :  V  —y  H  is  called  c-smooth  if  for  any  two  neighboring 
databases  D  ~  D'  we  have  that  d(p(D),  p(D')')  <  c. 
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Lemma  6.8.  Let  p  :  V  —y  TL  be  a  c-smooth  projection  ( i.  e. ,  for  every  I)  £  TL  we  have 
ft  iD)  =  D).  Then  for  every  query  f,  the  function  fu  =  f  °  T  satisfies  that  GSfn  < 
c-RSffH). 

Proof. 


GSfn  = 


=  ^max  |  fH  (Df)  -  fH  (D 2)| 

=  max  |  f  (fi  (Df))  -  f  (fi  (D2) 

U  1^192 

<  max  |/  (p  (Di))  -  f(p  (D2) 

L)\rsjJJ2 


<  c  ■  max 
d1,d2gh 

=  c  ■  RSf  (' H) 


1/ (go  -  /  (P2)\ 

d  (. D1 ,  D2) 


•  1 


d(p(D1)  ,p(D2)) 


□ 

As  we  now  show,  for  "H  =  Ri-  and  for  distances  defined  via  the  edge-adjacency  model, 
we  can  devise  an  efficient  smooth  projection. 

Claim  6.9.  In  the  edge-adjacency  model,  there  exists  an  efficiently  computable  3-smooth 
projection  to  TLk- 

Proof  We  construct  our  smooth-projection  p  by  first  fixing  a  canonical  ordering  over  all 
possible  edges.  Let  e\, ...,  e%  denote  the  edges  incident  to  v  in  canonical  order.  For  each 
edge  e  =  {u,v}  we  delete  e  if  and  only  if  (i)  e  =  ej1  for  j  >  k  or  (ii)  e  =  e“  for  j  >  k 
(Intuitively  for  each  v  with  deg  (v)  >  k  we  keep  this  first  k  edges  incident  to  v  and  flag 
the  other  edges  for  deletion).  If  G  G  TLk  then  no  edges  are  deleted,  so  p(G)  =  G.  Suppose 
that  G i ,  G‘>  are  neighbors  differing  on  one  edge  e  =  {x,  y}  (wlog,  say  that  e  is  in  G'i ). 
Observe  that  for  every  v  f  x,  y,  the  same  set  of  edges  incident  to  v  will  be  deleted  from 
both  G i  and  G2.  In  fact,  if  p{Gf)  does  not  contain  e  then  p(Gf)  =  p(G2).  Otherwise, 
if  e  is  not  deleted  we  may  assume  then  there  may  be  at  most  one  edge  ex  (incident  to  x) 
and  at  most  one  edge  ey  (incident  to  y)  that  were  deleted  from  p  (Gf)  but  not  from  p(G2). 
Hence,  d  (p  (Gf) ,  p  (G2))  <  3.  □ 

An  immediate  corollary  of  Lemma  6.8  and  Claim  6.9  is  the  following  theorem. 

Theorem  6.10.  (Privacy  wrt  Edge  Changes )  Given  any  query  for  social  networks  f, 
the  mechanism  that  uses  the  projection  p  from  Claim  6.9,  and  answers  the  query  using 
A(f ,  G)  =  f(p(G))  +  Lap(  3  •  RS f{TLk)  /  T)  preserves  (e,  0)  privacy  for  any  graph  G. 
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Now,  it  is  evident  that  this  mechanism  has  the  guarantee  that  for  every  G  G  'Hk  it  holds 
that  Pr[  A(f,  G)  —  /(G)  |  <  0(RSf('Hk)/e)]  >  2/3.  Furthermore,  if  the  querier  “lucked 
out”  to  ask  a  query  /  for  which  /(G)  and  f(p(G))  are  close  (say,  identical),  then  the  same 
guarantee  holds  for  such  G  as  well.  Note  however  that  we  cannot  reveal  to  the  querier 
whether  /(G)  and  f(p(G))  are  indeed  close,  as  such  information  might  leak  privacy. 

6.4.2.2  Projections  and  Smooth  Distances  Estimators 

Unfortunately,  the  smooth  projections  do  not  always  exist,  as  the  following  toy-example 
demonstrates.  Fix  n  graphs,  where  d  (Gt.  Gf)  =  | i  —  j  |  for  1  <  i,j  <  n,  and  let  K  = 
(Gi,  Gn}.  Because  p  (G i)  =  G i  and  p  (Gn)  =  Gn,  then  there  must  exist  some  value  i 
such  that  ii  (Gi)  f  p  (G*+ 1),  thus  every  p  cannot  be  c-smooth  for  c  <  n  —  1. 

Note  that  c-smooth  projections  have  the  property  that  they  also  provide  a  (c  +  1)- 
approximation  of  the  distance  of  D  to  H.  Meaning,  for  every  1)  we  have  that  d(D,  H)  < 
d{D,  p(D))  <  (c  +  1)  •  d(D,  PL).  In  the  vertex  adjacency  model  however,  it  is  evident  that 
we  cannot  have  a  0(l)-smooth  projection  since  it  is  NP-hard  to  approximate  d(G .  7f/,). 

Claim  6.11.  (Privacy  wrt  Vertex  Adjacency)  Unless  P  =  NP  there  is  no  ejficiently  com¬ 
putable  mapping  p  :  Q  — )■  jik  such  that 

1.  VG  G  Uk,p(G)  =  G. 

2.  VGeG.d  (G,  p  (G))  <  O  (In  (k)  d  ( G ,  Hk)). 

Proof  (Sketch)  Our  reduction  is  from  the  minimum  set  cover  problem.  It  is  NP-hard  to 
approximate  the  minimum  set  cover  problem  to  a  factor  better  than  O(log  n)  [126,  9]. 
Given  a  set  cover  instance  with  sets  and  universe  U  =  {xi,...,xn}  we  set 

m,  =  |{j  :  Xi  G  S'j}|  and  k  —  n  +  1.  We  construct  our  labeled  graph  G  as  follows: 

1.  Add  a  node  for  each  Si. 

2.  Add  a  node  for  each  Xj. 

3.  Add  the  edge  {xj,  S)}  if  and  only  if  x:)  G  S). 

4.  For  each  Xi,  create  k  +  1  —  fresh  nodes  y\ , ...,  and  add  each  edge  (x/j,  Xi}. 

Intuitively  each  node  Xj  has  k  +  1  incident  edges.  By  deleting  all  of  the  edges  incident  to 
the  node  Si  we  can  fix  all  of  the  nodes  x  G  S,.  Hence,  d  (G,  Hk)  corresponds  exactly  to 
the  size  of  the  minimum  set  cover.  □ 
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Though  computing  d(G,  TLk)  is  hard,  we  can  approximate  it. 

Claim  6.12.  (Privacy  wrt  Vertex  Adjacency)  There  is  an  efficiently  computable  projection 
p  :  Q  — >  Tik  such  that  for  every  G  E  Q  it  holds  that  d  (G,  p  (G))  <  (In  ( 2d 2  +  kdj)  d  (G,  TLk), 

Proof.  First,  we  define  the  potential  of  a  graph  G  as  follows 

<t>{G)=  (de§  (v)  “  k )  • 

v£G:deg(v)>k 


We  give  an  algorithm  to  construct  p.  The  algorithm  traverses  all  options  for  d  (G,  TLk). 
For  every  d  G  {1,2,...,  n  —  k)  we  do  the  following:  (1)  as  long  as  there  exists  nodes  with 
degree  >  k  +  d  +  1  we  arbitrarily  pick  one  such  node  v  and  remove  all  edges  touching  v; 
(2)  as  long  as  there  are  nodes  with  degree  >  k  +  1  we  greedily  pick  a  vertex  v  maximizing 
4>(G)  —  c>(G  —  v)  and  remove  all  edges  incident  to  v  (we  use  G  —  v  to  denote  the  graph 
where  we  remove  all  edges  incident  to  v).  We  define  p(G)  as  the  result  of  the  iteration 
touching  the  minimum  number  of  vertices. 

To  analyze  this  algorithm,  we  consider  the  output  of  the  algorithm  for  the  correct  guess 
d  —  d  (G,  Tik)-  Pick  a  path  of  neighboring  graphs  from  G  to  TLk  of  length  d,  and  label 
each  pair  of  neighboring  graphs  by  the  vertex  that  changes  between  the  two.  It  is  evident 
that  the  path  labels  must  contain  any  node  of  degree  >  k  +  d  +  1,  since  any  other  path  of 
length  d  may  reduce  the  degree  of  such  nodes  only  to  k  +  1. 

So  now,  denote  0O  as  the  potential  of  G  after  step  (1),  and  denote  the  set  of  t  nodes 
we  pick  in  step  (2)  as  v\,  v2,  ■  ■ , ,  vt  and  let  denote  the  potential  after  removing  all  edges 
touching  Observe  that  0O  <  2 d2  +  kd ,  because  there  exists  a  set  of  <  d  nodes  that 
by  removing  any  edge  incident  to  them  we  convert  the  graph  to  a  graph  in  TLk,  and  each 
of  these  nodes  decrease  the  potential  by  no  more  than  deg(v)  +  cleg(u)  -  k  <  (d  -\- 
k)  +  d  =  2d  +  k.  By  standard  greedy  analysis,  in  every  time  we  pick  a  vertex,  there 
always  exists  some  vertex  that  reduces  the  potential  by  a  factor  of  1  —  Therefore,  after 
t  =  dln(2d2  +  kd)  iterations  we  have  that  <t>t  <  (1  -  o  <1-  □ 

The  reduction  in  Claim  6.12  might  be  used  to  produce  a  function  fH  (G)  —  p(f  (G)) 
with  low  smooth-sensitivity  over  the  nice  graphs  TLk ■  Unfortunately,  we  don’t  know  of  any 
efficient  algorithm  to  compute  the  smooth  upper  bound  for  such  fu .  Yet,  we  show  that  it 
is  possible  to  devise  a  somewhat  relaxed  projection  s.t.  the  distance  between  a  database 
and  its  mapped  image  is  a  smooth  function.  To  that  end,  we  relax  a  little  the  definition  of 
projection,  allowing  it  to  map  instances  to  some  predefined  TL  D  TL. 
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Definition  6.13.  Fix  Ft  D  Ft.  Let  fi  be  a  projection  of  Ft,  so  /i  is  a  mapping  //  :  'D  — y  Ft 
that  maps  every  element  of  Ft  to  itself  (WD  e  Ft  we  have  that  p(D)  =  D).  A  c-smooth 
distance  estimator  is  a  function  :  V  — y  M  that  satisfies  all  of  the  following.  (1)  For 
every  D  e  Ft  it  is  defined  as  d^(D)  =  0.  (2)  It  is  lower  bounded  by  the  distance  of  D  to 
its  projection:  WD  e  V.  d^iD)  >  d(D,  p(Dj).  (3)  Its  value  over  neighboring  databases 
changes  by  at  most  c:  WD  ~  D\  d^(D)  —  d^(D')  \  <  c. 


It  is  simple  to  verify  that  for  every  D  e  V  we  have  that  d^(D)  <  c  ■  d(D,  Ft)  (using 
induction  on  d(D,  Ft)).  We  omit  the  subscript  when  p  is  specified. 

The  following  lemma  suggests  that  a  smooth  distance  estimator  allows  us  to  devise  a 
good  smooth-upper  bound  on  the  local-sensitivity,  thus  allowing  us  to  apply  the  smooth- 
sensitivity  scheme  of  [1 19]. 


Lemma  6.14.  Fix  Ft  D  Ft  and  let  p  :  V  —y  Ft  be  a  projection  of  Ft.  Let  d  :  V  —y  M  be  an 
efficiently  computable  c-smooth  distance  estimator.  Then  for  every  query  f,  we  can  define 
the  composition  fn  =  f  °  F  an<d  define  the  function 


S,nfi  ( D ) 


max  e(  ^  (2d  +  c  +  1)  •  RSf  (Ft) 

d£Zj,d>d(D) 


Then  Sfn  p  is  an  efficiently  computable  fd-smooth  upper  bound  on  the  local  sensitivity  of 

C+l 

Zj  — O  2  X  0  <  X  <  ^ 

Jy.  Furthermore,  define  g  as  the  function  g(x)  =  ’  —  ~  C+1  ■  Then 


c+l, 


X  > 


for  every  D  it  holds  that 


C+l 


S/J)  (D)  <  exp(f  d(D))  ■  g(p/c)RS,(H ) 


Proof.  First,  we  show  that  indeed  Sfnp  is  an  upper  bound  on  the  local  sensitivity  of  fH. 
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Fix  any  D  e  V  and  indeed 


LS,u(D)  =max|  fu(D)-fn(D')\ 

=  max  | f(n(D))  -  f(n(D'))\ 

L)'~D 

<  max  RS /p)  ■  d{n{D) ,  //(£>')) 

<  max  RSfCH )  ■ 

D'~D 

(dp,  /rp))  +  dp,  .D')  +  dp',  //p'))) 

<  RSf(H)  •  ( dp )  +  1  +  max  dp')) 

<RSf(H)-  (2dp)  +  c+l) 

<  max  g-0c(d-d(.D))  (2d  +  C  +  1)  RSf  (71) 

d>d(D ) 

=  (D)  ■ 

Next  we  prove  that  is  /3-smooth.  Let  I)x  and  P  be  two  neighboring  databases, 
and  wlog  assume  dp2 )  <  dpi).  Then 

SfnADl) 

SfHAD2) 

_  m axd>d{Dl)e^  ^  ^  (2d  +  c  +  1)  RSf  p) 

maXd>d(r>2)  (2rf  +  C  +  !)  (^) 

Let  d0  be  the  value  of  d  on  which  the  maximum  of  numerator  is  obtained.  Then 

SfHAD  i) 

SfuAD-2) 

exp  (-£  (d0- dpi)))  (2d0  +  c  +  1)  RSf  p) 
maxd>d(D2)  exP  (~f  ( d  ~  d  ( D-2 )))  (2d  +  c  +  1)  RSf  (n) 
exp  (-£  (d0  -  d  pi)))  (2d0  +  c  +  1)  p) 

~  exp  (-£  (d0  -  d  p2)))  (2d0  +  c  +  1)  i2S>  p) 

=  exp  (-£(dp2)  -  dpi)))  <  exp(/3) 


84 


where  the  last  inequality  uses  the  smoothness  property,  i.e.  that  d(D2 )  —  d(D i)  >  —  c. 
Finally,  we  wish  to  prove  the  global  upper  bound  on  Sfn ijg,  i.e.,  that  for  every  D  e  V 

SfH,P  ( D )  <  exp(f  d(D))  ■  g  (c//3)  RSf  (H)  ■ 

Fix  D  and  define  h{x)  =  exp  ( —  |x)  (2x  +  c  +  1),  so  that  Sfn,p  =  exp  (f  d(D))RSf(H)  ■ 
max  h{d).  Taking  the  derivative  of  h  we  have 

d>do 

h\X)  =  e-~cx  (-2x§  -  (3  -  f  +  2) 

which  means  that  /i(x)  is  maximized  at  x0  =  |  In  the  case  that  x0  <  0  (i.e. 

for  fd/c  >  we  can  upper  bound  the  function  h(x)  with  h(0)  =  c  +  1  for  every 
x  >  0.  Otherwise,  we  have  that  h(x)  <  h(x0 )  for  every  x  >  0,  and  indeed  h(x0)  = 

I  §_  c+1 

2fe  1+e-  2  =g(p/c). 

To  conclude  the  proof,  observe  that  computing  SfHip  (D)  is  just  a  simple  optimiza¬ 
tion  once  d  (D)  is  known,  much  like  the  derivation  done  above.  So  since  d  is  efficiently 
computable,  we  have  that  SfHjp  is  efficiently  computable.  □ 

Like  in  the  edge- adjacency  model,  we  now  exhibit  a  projection  and  a  smooth  distance 
estimator  for  the  vertex-adjacency  model. 

Claim  6.15.  In  the  vertex-adjacency  model,  there  exists  a  projection  p  :  Q  — >  TL/,-  and  a 
4-smooth  distance  estimator  d,  both  of  which  are  efficiently  computable. 

Proof  To  construct  p  and  d  we  start  with  the  linear  program  that  determines  a  “fractional 
distance”  from  a  graph  to  Hk.  This  LP  has  n  +  (”)  variables:  xu  which  intuitively  repre¬ 
sents  whether  xu  ought  to  be  removed  from  the  graph  or  not,  and  wUjV  which  represents 
whether  the  edge  between  u  and  v  remains  in  the  projected  graph  or  not.  We  also  use  the 
notation  aUjV,  where  auv  =  1  if  the  edge  {u,  r>}  is  in  G;  otherwise  auv  =  0. 

min  xv  s.t. 

(1)  Vu,  xv  >  0 

(2)  Vw,  v,  wUtV  >  0 

(3)  Vti,  V ,  Ojyy  ^  tVUy  f  OjUy  Xy  Xy 

(4)  \/u,  wUtV  <  k 
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To  convert  our  fractional  solution  (x*,w*)  to  a  graph  p  (G)  G  "H2fc  we  define  p  ( G )  to  be 
the  graph  we  get  by  removing  every  edge  ( u ,  v)  G  E (G)  whose  either  endpoint  has  weight 
x*u  >  1/4  or  x*  >  1/4.  We  define  our  distance  estimator  as  d  (G)  =  4  J2uxu- 

We  need  to  show  that  p  and  cl,  satisfy  the  conditions  of  claim  6.15.  We  first  prove  that 
p  is  a  projection  mapping  every  graph  to  a  graph  in  H2k.  Suppose  that  some  v  G  G  has 
degree  >  2k,  then  clearly  x*v  <  1/4,  for  otherwise  we  would  have  removed  all  of  the 
edges  touching  v.  Observe  that  every  edge  we  keep  has  w*uv  >  1  —  1/4  —  1/4  =  1/2. 
Consequently,  we  can  have  at  most  2k  edges  with  wUjV  >  \  because  of  the  constraint 
Yhu  wu,v  <  k.  So  there  are  at  most  2k  edges  incident  to  v  in  p  (G). 

Now,  let  us  prove  that  d  satisfies  all  of  the  requirements  of  a  4-smooth  distance  estima¬ 
tor.  First,  if  G  G  j-Lk  then  the  optimal  solution  of  the  LP  is  the  all  zero  vector,  so  d(G)  =  0 
for  all  graphs  of  max-degree  <  k.  Secondly,  observe  that  in  the  process  of  computing 
li{G),  every  edge  that  is  removed  from  G  can  be  “charged”  to  a  vertex  v  with  x*v  >  1/4. 
If  follows  that 

d(G,n(G))<  J2  1  <  E  4<  <4  E< 

v:x*>l/4  v:x*>l/i  v 

Lastly,  fix  any  neighboring  G\,G2  G  Q,  and  let  v  be  the  vertex  whose  edges  differ  in  G  \ 
and  G2.  Clearly,  if  x*  is  a  solution  for  LP{G\),  then  we  set  yv  —  1  for  i  —  1  ...d  and 
yv  =  x*,  otherwise.  Now  y  is  a  feasible  (not  necessarily  optimal)  solution  to  LP  (G2).  It 
is  simple  to  infer  that 

d(G2)-d(G1)  =d(G2)-4j2< 


<  4^r/n-4^<  <  4^  \(yu~x*u) | 

u  u  u 

=  4\yv-x*v\  <4 

□ 


As  before,  combining  Lemma  6.14  with  Claim  6.15  gives  the  following  theorem  as  an 
immediate  corollary. 

Theorem  6.16.  (Privacy  wrt  Vertex  Adjacency)  Given  any  query  for  social  networks  f,  the 
mechanism  that  uses  the  projection  pfrom  Claim  6.15  and  the  (5-smooth  upper  bound  of 
Lemma  6.14,  and  answers  the  query  using  A(f ,  G )  =  f(p(G))  +  La,p(2-Sfn  -e/2\n5(G)  /  e) 
preserves  (e,  5)  privacy  for  any  graph  G. 

Again,  it  is  evident  from  the  definition  that  the  algorithm  has  the  guarantee  that  for 

every  G  G%  it  holds  that  Pr[  \A(f,  G)  -  /(G)  |  <  0(g(8 ln(\/(5) ) RSf  (U2k) /e)\  >  2/3. 
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6.5  Restricted  Sensitivity  and  Hk 


Now  that  we  have  constructed  the  machinery  of  restricted  sensitivity,  we  compare  the 
restricted  sensitivity  over  Hk  with  smooth  sensitivity  for  specific  types  of  queries,  in  order 
to  demonstrate  the  benefits  of  our  approach.  In  a  nutshell,  restricted  sensitivity  offers  a 
significant  advantage  over  smooth  sensitivity  whenever  k  <F  n.  I.e.,  we  show  that  there 
are  queries  /  s.t.  for  some  G  G  Hk  it  holds  that  RSf  (Hk)  <C  Sf $  (G).  We  define  two 
types  of  queries:  local  profile  queries  and  subgraph  counting  queries. 

6.5.1  Local  Profile  Queries 

First,  let  us  introduce  some  notation.  A  profile  is  a  function  that  maps  a  vertex  v  in  a 
social  network  (G,£)  to  [0,1].  Given  a  set  of  vertices  {vi,v2,  ■  ■  ■  ,vt},  we  denote  by 
G[v . . . ,  vt\  the  social  network  derived  by  restricting  G  and  i  to  these  t  vertices. 
We  use  Gv  =  G[{u}  U  {re  j (v,w)  G  E  (G)}}  to  denote  the  social  network  derived 
by  restricting  G  and  £  to  v  and  its  neighbors.  A  local  profile  satisfies  the  constraint 
p(v,  (G,  £))  =p(v,Gv). 

Definition  6.17.  A  (local)  profile  query 

A  (G,Q  =  ’Exv(a>p(v,(G,e)) 

sums  the  (local)  profile  p  across  all  nodes. 

Indeed,  local  profile  queries  are  our  motivating  example.  They  capture  a  class  of  ques¬ 
tion  often  asked  in  social  networks,  such  as:  “how  many  people  are  doctors?”,  “how  many 
people  are  friends  with  t  doctors?”,  “how  many  people  are  friends  with  at  least  7  doctors 
who  are  all  friends  with  each  other?”,  etc. 

Claim  6.18  bounds  the  restricted  sensitivity  of  a  local  profile  query  over  Hk  (e.g.,  in 
the  vertex  adjacency  model  a  node  v  can  at  worst  affect  the  local  profiles  of  itself,  its  k  old 
neighbors  and  its  k  new  neighbors). 

Claim  6.18.  For  any  local  profile  query  f,  we  have  that  RSf  (Hk)  <  2/c  +  1  in  the  vertex 
adjacency  model,  and  RSf  (Hk)  <  k  +  1  in  the  edge  adjacency  model. 

Proof.  Consider  a  local  profile  query  fp. 

(Label  change)  Let  Gi,G2  G  H  be  two  graphs  with  the  same  exact  edge  set,  but  with 
labeling  functions  G,  (2  that  are  different  on  a  single  vertex.  Let  v  be  the  vertex  whose  label 
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differs  on  G\  and  G\,  and  let  Nv  denote  the  set  of  its  (at  most  k )  neighbors.  Then  for  every 
u  <£  {v}UNv  we  have  that  p  (u,  (Gi,G))  =p(u,  (G2,£2)).  Hence,  |  fp  (Gi)  -  fv  (G2)|  < 
|{n}  U  Nv\  <  k  +  1. 

(Vertex  Adjacency)  Let  G],  G2  G  H  be  any  two  neighboring  labeled  graphs  such  that 
Gi  —  v  =  G2  —  v.  Let  N'  (resp.  N%)  denote  the  neighborhood  of  v  then  for  any  y  ^ 

N]  U  N%  we  have  that  p  (y,  (GMi))  =  p(y,(G2,£2)).  Hence,  |/p(Gi)  -  /P(G2)|  < 
\N^UN^U{v}\  <2k  +  l. 

(Edge  Adjacency)  Let  G\,  G2  G  H  be  any  two  neighboring  labeled  graphs.  Wlog,  there 
is  an  edge  e  =  {u,v}  such  that  E(G\)  =  E(G2)  U  {e}.  In  order  to  have  a  vertex  y 
s.t.  p(y,  (G i,£i))  ^  p(y ,  (G2,£ 2))  we  need  that  the  edge  e  appears  in  graph  we  get 
by  restricting  the  social  network  to  set  of  y  and  its  neighbors.  It  follows  that  the  only 
vertices  whose  local  profile  can  change  are  in  the  union  {u,v}  U  (Nu  D  iV,,).  Hence, 

I  /  (Gi)  -  f  (G2)  I  <  |{m}  U  M|  +  |  Nu  \  Ml  <2  +  k-l  =  k  +  l.  □ 

By  contrast  the  smooth  sensitivity  of  a  local  profile  query  may  be  as  large  as  0(n)  even 
for  graphs  in  Hk-  Consider  the  local  profile  query  “how  many  people  are  friends  with  a 
spy?”  In  the  vertex-adjacency  model,  the  n  —  1-star  graph  6j  in  which  a  spy  v  is  friends 
with  everyone  is  adjacent  to  the  empty  graph  G0  G  'Hk-  Therefore,  any  smooth  upper 
bound  Sftp  must  have  Sftp(G)  >  n  —  1.  It  is  also  worth  observing  that  the  assumption 
G  G  Hk  does  not  necessarily  shrink  the  range  of  possible  answers  to  a  local  profile  query 
/  (e.g.,  there  are  graphs  G  G  Hk  in  which  everyone  is  friends  with  a  spy). 

We  comment  that  local  profile  queries  are  a  natural  extension  of  predicates  to  social 
networks,  which  can  be  used  to  study  many  interesting  properties  of  a  social  network 
like  clustering  coefficients,  local  bridges  and  2-betweeness).  The  clustering  coefficient 
c(v)  [142,  118]  of  a  node  v  (e.g.,  the  probability  that  two  randomly  selected  friends  of  v 
are  friends  with  each  other)  has  been  used  to  identify  teenage  girls  who  are  more  likely 
to  consider  suicide  [34].  One  explanation,  is  that  it  becomes  an  inherent  source  of  stress 
if  a  person  has  many  friends  who  are  not  friends  with  each  other  [66].  Observe  that  c(v) 
is  a  local  profile  query.  An  edge  {v,w}  is  a  local  bridge  if  its  endpoints  have  no  friends 
in  common.  A  local  profile  could  score  a  vertex  v  based  on  the  number  local  bridges 
incident  to  v.  A  marketing  agency  may  be  interested  in  identifying  nodes  that  are  incident 
to  many  local  bridges  because  local  bridges  “provide  their  endpoints  with  access  to  parts 
of  the  network  -  and  hence  sources  of  information  -  that  they  would  otherwise  be  far 
away  from  [66].”  For  example,  a  1995  study  showed  that  the  best  job  leads  often  come 
from  aquaintances  rather  than  close  friends  [75].  2-betweeness  (a  variant  of  betweeness 
[70])  measures  the  centrality  of  a  node.  We  say  that  the  2-betweeness  of  a  vertex  v  is  the 
probability  that  the  a  randomly  chosen  shortest  path  between  x,  y  G  Gv,  two  randomly 
chosen  neighbors  of  v,  goes  through  v. 
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6.5.2  Subgraph  Counting  Queries 


Subgraph  queries  allows  us  to  ask  questions  such  as  “how  many  triplets  of  people  are  all 
friends  when  two  of  them  are  doctors  and  the  other  is  a  pop-singer?”  or  “how  many  paths 
of  length  2  are  there  connecting  a  spy  and  a  pop-singer  over  40?”  The  average  clustering 
coefficient  of  a  graph  can  be  computed  from  the  number  of  triangles  and  2-stars  in  a  graph. 

Definition  6.19.  A  subgraph  counting  query  f  =  (H,  p)  is  given  by  a  connected  graph  H 
over  t  vertices  and  t  predicates  p\,p2,  ■  •  •  ,Pt-  Given  a  social  network  (G,  £),  the  answer 
to  f  (G.  £)  is  the  size  of  the  set 

{vi,  v2,  G[v  1,v2,  ...,vt]  =  H  and  Vi,  £  (vf)  6ft} 


The  smooth  sensitivity  of  a  subgraph  counting  query  may  be  as  high  as  O  (n*_1)  in  the 
vertex  adjacency  model.  Let  /  =  ( H ,  p)  be  a  subgraph  counting  query  where  H  is  a  t-star 
and  each  predicate  pt  is  identically  true.  Let  Gi  be  a  n- star  (/  (G i)  =  (C"J).  Then  in 
the  vertex  adjacency  model  there  is  a  neighboring  graph  G2  with  no  edges  (/  (G2)  =  0). 
We  have  that  LSj  (Go)  >  (t"1).  Observe  that  G2  €  'Hk-  We  now  show  that  the  smooth 
sensitivity  of  /  =  ( K3,p )  is  always  greater  than  n  when  each  predicate  p%  is  identically 
true. 

Claim  6.20.  Let  f  =  ( K.> ,  p)  be  a  subgraph  counting  query  with  predicates  pt  that  are 
identically  true.  In  the  vertex  adjacency  model  for  any  (3  smooth  upper  bound  on  the  local 
sensitivity  of  f  and  any  graph  G  we  have  SjP  ^  (G)  >  exp  (—2/3)  (■ n  —  2). 

Proof  Let  G  be  given.  Pick  v \ ,  v2  e  V (G)  and  let  Gj  be  obtained  from  G  by  adding  all 
possible  edges  incident  to  v1  and  let  G2  be  obtained  from  6',  by  deleting  all  edges  incident 
to  v2.  Finally,  let  G:i  be  obtained  from  G-2  by  adding  all  possible  edges  incident  to  v2.  Now 
the  local  sensitivity  of  /  at  G2  is  at  least  n  —  2, 

LSf(G2)  =G;rfmax)=i|/(G2)-/(G')| 

>f(G3)-f(G2)>n- 2 


Plugging  this  lower  bound  into  the  definition  of  f3  smooth  sensitivity  we  obtain  the  required 
result:  S*P  p  (G)  >  e~^G^LSfP  (G2)  >  e"2/ 3  (n  -  2).  □ 

By  contrast  Claim  6.21  bounds  the  restricted  sensitivity  of  subgraph  counting  queries. 

Claim  6.21.  Let  f  =  ( H,p )  be  subgraph  counting  query  and  let  t  =  \H\  then  RSf  (Hk)  < 
tk1  1  in  the  edge  adjacency  model  and  in  the  vertex  adjacency  model. 
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Proof  sketch.  Let  Gi,  G2  €  Hi-  be  neighbors  and  let  v  be  a  vertex  such  that  G  i  —  v  — 
G2  —  v,  and  let  N,  denote  the  neighbors  of  v  in  G,.  Any  copy  of  H  which  occurs  in  G  \ 
but  not  in  G2  must  contain  v.  Because  H  is  connected  we  can  bound  the  number  of  G\ 
copies  of  H.  We  can  start  with  v,  and  we  pick  one  of  the  t  vertices  of  II  to  be  mapped 
to  v.  Denote  this  vertex  as  v0.  Now,  we  proceed  inductively.  We  pick  a  vertex  v  £  II. 
connected  to  the  set  {n0,  v  1 ,  ry-  —  1} •  The  vertex  vt  must  be  assigned  to  a  vertex  in  G 
which  is  incident  to  some  specific  vertex  of  the  i  vertices  that  we  already  mapped.  Because 
we  have  bounded  degree,  then  there  are  at  most  k  options  from  which  to  choose  ry.  We 
obtain  the  bound:  /  (Gi)  —  /  (G2)  <  t  JI*=i  k  =  tk t~1.  □ 

While  the  assumption  G  £  Hk  may  shrink  the  range  of  a  subgraph  counting  query  /, 
the  restricted  sensitivity  of  /  will  typically  be  much  smaller  than  this  reduced  range.  For 
example,  if  f(G)  counts  the  number  of  triangles  in  G  then  f(G)  <  nk 2  for  any  G  £  Hk, 
while  RSf  (Hk)  <  3 k2  <  nk2. 


6.6  Future  Questions/Directions 

Efficient  Mappings:  While  we  can  show  that  there  doesn’t  exist  an  efficiently  com¬ 
putable  0(l)-smooth  projection  n  :  Q  — >  Hk,  we  don’t  know  whether  the  construction  of 
Claim  6.15  can  be  improved.  Meaning,  there  could  be  a  mapping  /i  :  Q  — »  H  for  some 
H  D  Hk,  whether  the  solution  itself,  the  set  of  vertices  that  dominate  the  removed  edges, 
is  smooth.  In  other  words,  Is  there  an  efficiently  computable  mapping  n  :  Q  — »  H  C  Hk 
which  satisfies  | d  (ji  (Gf) ,  Gf)  —  d  (ji  (G2) ,  G2)  \  <  c  for  some  constant  c? 


Multiple  Queries:  We  primarily  focus  on  improving  the  accuracy  of  a  single  query  /. 
Could  the  notion  of  restricted  sensitivity  be  used  in  conjunction  with  other  mechanisms 
(e.g.,  exponential  mechanism  [43],  Private  Multiplicative  Weights  mechanism  [80],  etc.) 
to  accurately  answer  an  entire  class  of  queries?  Chapter  5  gives  a  mechanism  that  answers 
multiple  cut-queries  in  the  edge-adjacency  model,  but  those  are  queries  whose  global  sen¬ 
sitivity  is  1.  It  would  be  interesting  to  consider  answering  multiple  cut-queries  for  the 
vertex- adjacency  model,  where  the  query  has  local  sensitivity  of  n. 

Alternate  Hypotheses:  We  focused  on  the  specific  hypothesis  Hk-  What  other  natural 
hypothesis  could  be  used  to  restrict  sensitivity  in  private  data  analysis?  Given  such  a 
hypothesis  H  can  we  efficiently  construct  a  query  fu  with  low  global  sensitivity  or  with 
low  smooth  sensitivity  over  datasets  D  £  HI 
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Part  III 


Clustering 
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Chapter  7 

Stability  yields  a  PTAS  for  /-Median 
and  A  -Means  Clustering 

7.1  Introduction 


In  this  chapter  we  study  the  two  popular  clustering  objectives  of  /(-median  and  /(-means. 
Both  problems  were  surveyed  in  the  introduction  (see  Chapter  4).  For  the  problem  of  k- 
means,  the  work  of  Ostrovsky  et  al  [121]  gives  a  new  approximation  based  on  separability 
of  the  clustering  instance  (See  4.2.2).  Ostrovsky  et  al  showed  that  for  a  /(-means  instance 
for  which  the  optimal  /(-clustering  has  cost  significantly  smaller  than  the  cost  of  any  (k  — 
l)-clustering,  then  one  can  get  a  very  good  poly-time  (in  both  n  and  k)  approximation  of 
the  /(-means  objective.  Specifically,  if  the  ratio  of  the  optimal  (k  —  l)-clustering  to  the 
optimal  /(-clustering  is  at  least  max{100, 1/a},  then  one  can  find  a  /(-clustering  whose 
cost  is  (1  +  0(a))  OPT.  Here,  we  substantially  improve  on  this  approximation  guarantee. 
We  show  that  under  the  much  weaker  assumption  that  the  ratio  of  these  costs  is  just  at  least 
(1  +  a)  for  some  constant  a  >  0,  we  can  achieve  a  PTAS:  namely,  (1  +  e) -approximate  the 
/(-means  optimum,  for  any  constant  e  >  0.  Our  approximation  scheme  runs  in  time  which 
is  poly(n,  k )  and  exponential  only  in  1/e  and  1/a.  Thus,  we  decouple  the  strength  of  the 
assumption  from  the  quality  of  the  conclusion,  and  in  the  process  allow  the  assumption 
to  be  substantially  weaker.  For  /(-means  clustering  we  in  addition  give  a  randomized 
algorithm  with  improved  running  time  n° W  (. k  log  n)poly(1/6,1/(1). 

We  also  reiterate  the  results  of  Balcan,  Blum  and  Gupta  [25]  from  Chapter  4.2.1.  Bal- 
can  et  al,  motivated  by  the  fact  that  objective  functions  are  often  just  a  proxy  for  the 
underlying  goal  of  getting  the  data  clustered  correctly,  propose  clustering  instances  that 
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satisfy  the  condition  that  all  (1  +  a)  approximations  to  the  given  objective  (e.g.,  A- median 
or  k- means)  are  5-close,  in  terms  of  how  points  are  partitioned,  to  a  target  clustering  (such 
as  a  correct  clustering  of  proteins  by  function  or  a  correct  clustering  of  images  by  who  is 
in  them).  Balcan  et  al  prove  that  for  any  a  and  5,  given  an  instance  satisfying  this  property 
for  A;-median  or  A;-means  objectives,  one  can  in  fact  efficiently  produce  a  clustering  that  is 
0(5/a)-close  to  the  target  clustering  (so,  0(5)-close  for  any  constant  a  >  0),  even  though 
obtaining  a  1  +  a  approximation  to  the  objective  is  NP -hard  for  a  <  /.  Thus  they  show 
that  one  can  approximate  the  target  even  though  it  is  hard  to  approximate  the  objective. 
One  interesting  question  that  has  remained  is  the  approximability  of  the  objectives  when 
all  target  clusters  are  large  compared  to  8n,  since  the  hardness  of  approximation  requires 
allowing  small  clusters. 

Here,  we  show  that  for  both  A  - median  and  A;-means  objectives,  if  all  clusters  contain 
more  than  Sn  points,  then  for  any  constant  a  >  0  we  can  in  fact  get  a  PTAS.  Thus,  we 
(nearly)  resolve  the  approximability  of  these  objectives  under  this  condition.  Note  that 
under  this  condition,  this  further  implies  finding  a  5-close  clustering  (setting  e  =  a). 
Thus,  we  also  extend  the  results  of  Balcan  et  al  [25]  in  the  case  of  large  clusters  and 
constant  a  by  getting  exactly  5-close  for  both  A;-median  and  A;-means  objectives.  (In  [25] 
this  exact  closeness  was  achieved  for  the  A:-median  objective  but  needed  a  somewhat  larger 
0(8n(  1  +  1/a))  minimum  cluster  size  requirement). 

Our  algorithmic  results  are  achieved  by  examining  implications  of  a  property  we  call 
weak  deletion- stability  that  is  implied  by  both  the  separation  condition  of  Ostrovsky  et 
al  [121]  as  well  as  (when  target  clusters  are  large)  the  stability  condition  of  Balcan  et 
al  [25].  In  particular,  an  instance  of  A;-median/A;-means  clustering  satisfies  weak  deletion- 
stability  if  in  the  optimal  solution,  deleting  any  of  the  centers  c*  and  assigning  all  points 
in  cluster  i  instead  to  one  of  the  remaining  k  —  1  centers  c* ,  results  in  an  increase  in  the 
A;-median/A;-means  cost  by  an  (arbitrarily  small)  constant  factor.  We  also  show  that  weak 
deletion-stability  still  allows  for  N  P-hard  instances  and  that  no  FPTAS  is  possible  as  well 
(unless  P  =  NP).  Thus,  our  algorithm,  whose  running  time  is  (nA;)poly(1/€’1/'9),  is  optimal 
in  the  sense  that  the  super-polynomial  dependence  on  1/e  and  1//3  is  unavoidable. 


Organization.  After  presenting  notation  and  preliminaries,  we  introduce  weak  deletion- 
stability  and  relate  it  to  the  stability  notions  of  [121]  and  [25]  in  Section  7.2.  We  then 
define  another  property  of  a  clustering  being  (d- distributed  which,  while  not  so  intuitive, 
we  show  is  implied  by  weak  deletion- stability  and  will  be  the  actual  condition  that  our 
algorithms  will  use.  We  then  go  on  to  prove  that  being  /^-distributed  suffices  to  give  a 
PTAS  for  A;-median  in  Section  7.3.  We  extend  the  algorithm  to  A;-means  clustering  in 
Section  7.4,  where  we  also  introduce  a  randomized  version  whose  run-time  is  bounded 
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by  n3  ((log(n)  •  /s) )polyf  1/e’1/^3) .  We  conclude  with  discussion  and  open  problems  in  Sec¬ 
tion  7.5. 

Notation.  Throughout  the  chapter,  we  assume  that  we  are  given  a  set  S  of  n  points, 
which  we  partition  into  k  disjoint  subsets.  When  discussing  /t-median,  we  assume  the 
n  points  reside  in  a  finite  metric  space,  and  when  discussing  /c-means,  we  assume  they 
all  reside  in  a  finite  dimensional  Euclidean  space.  We  also  use  the  abbreviation  pct  = 
ygrr  YlxeCi  x-  f°r  the  center  of  mass  of  the  points  in  cluster  Ct.  We  use  OPT  to  denote  the 
cost  of  the  optimal  clustering  C*.  We  use  OPT,  to  denote  the  contribution  of  the  cluster  i 
to  OPT,  that  is  OPT?:  =  Yhx&c*  d(x,  ci)  the  k- median  case,  or  OPT;  =  Y^x&c*  d2(x,  ci ) 
in  the  k- means  case. 


7.2  Stability  Properties 

As  mentioned  above,  our  results  are  achieved  by  exploiting  implications  of  a  stability 
condition  we  call  weak  deletion-stability,  and  in  particular  an  implication  we  call  being 
/^-distributed.  In  this  section  we  define  weak  deletion- stability  and  of  being  /^-distributed, 
relate  weak  deletion- stability  to  conditions  of  Ostrovsky  et  al  [121]  and  Balcan  et  al  [25], 
and  show  that  weak  deletion-stability  implies  the  clustering  is  /^-distributed.  In  Sections 

7.3  and  7.4  we  use  the  property  of  being  /3-distributed  to  obtain  a  PTAS.1 

Definition  7.1.  For  a  >  0,  a  k-median/k-means  instance  satisfies  (1  +  a)  weak  deletion- 
stability,  if  it  has  the  following  property.  Let  {c*,  cfi  . . . ,  c*k}  denote  the  centers  in  the 
optimal  k-median/k-means  solution.  Let  OPT  denote  the  optimal  k-median/k-means  cost 
and  let  OPTf'^'/!  denote  the  cost  of  the  clustering  obtained  by  removing  c*  as  a  center  and 
assigning  all  its  points  instead  to  c*.  Then  for  any  i  f  j,  it  holds  that 

OPT(^')  >  (l  +  «)OPT 

We  use  weak  deletion-stability  via  the  following  implication  we  call  being  /^-distributed. 

Definition  7.2.  For  (3  >  0,  a  k-median  instance  is  /^-distributed  if  for  any  center  c*  of  the 
optimal  clustering  and  any  data  point  x  f  C*,  it  holds  that 

d(x,c*)  >(3- 

'Technically,  we  could  skip  the  “middleman”  of  weak  deletion-stability  and  just  define  the  property  of 
being  /^-distributed  as  our  main  stability  notion,  but  weak  deletion-stability  is  a  more  intuitive  condition. 
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A  k-means  instance  is  (3 -distributed  if  for  any  such  c*  and  x  f  C*,  it  holds  that 


d2(x,c*)>/3  — 

We  prove  that  (1  +  a)  weak  deletion-stability  implies  the  clustering  is  q /2-di stri  butcd 
for  A-mcdian  (a  /4-di  stri  butcd  for  A-mcans)  in  Theorem  7.5  below.  First,  however,  we 
relate  weak  deletion-stability  to  the  conditions  considered  in  [121]  and  [25]. 


7.2.1  ORSS-Separability 


Ostrovsky,  Rabani,  Schulman  and  Swamy  [121]  define  a  clustering  instance  to  be  e- 
separated  if  the  optimal  k- means  solution  is  cheaper  than  the  optimal  (k  —  1) -means 
solution  by  at  least  a  factor  e2.  For  a  given  objective  (/c-means  or  /c-median)  let  us  use 
OPT(fc_!)  to  denote  the  cost  of  the  optimal  (k  —  l)-clustering.  Introducing  a  parameter 
a  >  0,  say  a  clustering  instance  is  (1  +  a)-ORSS  separable  if 


OPT(fc_i) 

OPT 


4>  1  T  OL 


If  an  instance  satisfies  (1  +  a)-ORSS  separability  then  all  (k  —  1)  clusterings  must  have 
cost  more  than  (1  +  a) OPT  and  hence  it  is  immediately  evident  that  the  instance  will  also 
satisfy  (1  +  ct) -weak  deletion- stability.  Hence  we  have  the  following  claim: 

Claim  7.3.  Any  (1  +a)-ORSS  separable  k-median/k-means  instance  is  also  (1 + a) -weakly 
deletion  stable. 


7.2.2  BBG-Stability 

Balcan,  Blum,  and  Gupta  [25]  (see  also  Balcan  and  Braverman  [28]  and  Balcan,  Roglin, 
and  Teng  [30])  consider  a  notion  of  stability  to  approximations  motivated  by  settings 
in  which  there  exists  some  (unknown)  target  clustering  Ctar9et  we  would  like  to  pro¬ 
duce.  Balcan  et  al  [25]  define  a  clustering  instance  to  be  (1  +  a,  S)  approximation- 
stable  with  respect  to  some  objective  <f>  (such  as  /c-median  or  k- means),  if  any  /c-partition 
whose  cost  under  is  at  most  (1  +  a) OPT  agrees  with  the  target  clustering  on  all  but  at 
most  5n  data  points.  That  is,  for  any  (1  +  a)  approximation  C  to  objective  <I>,  we  have 
rnirvgs^  \  ci:ar9et  —  Ca(p  \  <  Sn  (here,  a  is  simply  a  matching  of  the  indices  in  the  tar¬ 
get  clustering  to  those  in  C ).  In  general,  Sn  may  be  larger  than  the  smallest  target  cluster 
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size,  and  in  that  case  approximation- stability  need  not  imply  weak  deletion- stability  (not 
surprisingly  since  [25]  show  that  ^-median  and  A- means  remain  hard  to  approximate). 
However,  when  all  target  clusters  have  size  greater  than  Sn  (note  that  S  need  not  be  a  con¬ 
stant)  then  approximation-stability  indeed  also  implies  weak  deletion-stability,  allowing 
us  to  get  a  PTAS  (and  thereby  5-close  to  the  target)  when  a  >  0  is  a  constant. 

Claim  7.4.  A  k-median/k-means  clustering  instance  that  satisfies  (1+a,  5)  approximation- 
stability,  and  in  which  all  clusters  in  the  target  clustering  have  size  greater  than  Sn,  cdso 
satisfies  (1  +  a)  weak  deletion- stability. 

Proof.  Consider  an  instance  of  /c-median//c-means  clustering  which  satisfies  (1  +  a,  5) 
approximation-stability.  As  before,  let  { c\ ,  c*2 . ,  c*k}  be  the  centers  in  the  optimal  so¬ 
lution  and  consider  the  clustering  obtained  by  no  longer  using  c*  as  a  center  and 

instead  assigning  each  point  from  cluster  i  to  c*,  making  the  zth  cluster  empty.  The  dis¬ 
tance  of  this  clustering  from  the  target  is  defined  as  f  min0.e5fc  \cr~-C%?\.  Since 
has  only  (k  —  1)  nonempty  clusters,  one  of  the  target  clusters  must  map  to  an  empty 
cluster  under  any  permutation  a.  Since  by  assumption,  this  target  cluster  has  more  than 
Sn  points,  the  distance  between  Ctar9et  and  C(l^3)  will  be  greater  than  S  and  hence  by 
the  BBG  stability  condition,  the  A  -mcdian/A  -mcans  cost  of  C(l^3)  must  be  greater  than 
(l  +  a)OPT.  □ 

7.2.3  Weak  Deletion-Stability  implies  /^-distributed 

We  show  now  that  weak  deletion- stability  implies  the  instance  is  /^-distributed. 

Theorem  7.5.  Any  {1  + a) -weakly  deletion- stable  k-median  instance  is  |- distributed .  Any 
(1  +  a)-weakly  deletion- stable  k-means  instance  is  distributed . 

Proof.  Fix  any  center  in  the  optimal  /^-clustering,  c*,  and  fix  any  point  p  that  does  not 
belong  to  the  C*  cluster.  Denote  by  C*  the  cluster  that  p  is  assigned  to  in  the  optimal 
/c-clustering.  Therefore  it  must  hold  that  d(p,c*)  <  d(p,c*).  Consider  the  clustering 
obtained  by  deleting  c*  from  the  list  of  centers,  and  assigning  each  point  in  CJ*  to  C* . 
Since  the  instance  is  (1  +  a)-weakly  deletion-stable,  this  should  increase  the  cost  by  at 
least  qOPT. 

Suppose  we  are  dealing  with  a  /c-median  instance.  Each  point  xeC*  originally  pays 
d(x,  c*),  and  now,  assigned  to  c*,  it  pays  d(x,  c *)  <  d(x,  c *)  +  d(cf  c*).  Thus,  the  new 
cost  of  the  points  in  C*  is  upper  bounded  by  d(x,c*)  <  OPT;  +  I  C*\d(c*,c*). 

As  the  increase  in  cost  is  lower  bounded  by  a;OPT  and  upper  bounded  by  \C*\d(c*,  c *), 
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we  deduce  that  d(c*,  c* )  >  a  .  Observe  that  triangle  inequality  gives  that  d(c*,c*)  < 
d(c*,p )  +  d(p,  Cj)  <  2d(c*,p),  so  we  have  that  d(c*,p)  >  (a/2)^y. 

Suppose  we  are  dealing  with  a  Euclidean  /,  -mcans  instance.  Again,  we  have  created  a 
new  clustering  by  assigning  all  points  in  C*  to  the  center  c*.  Thus,  the  cost  of  transitioning 
from  the  optimal  /c-clustering  to  this  new  (k  —  l)-clustering,  which  is  at  least  ctOPT,  is 
upper  bounded  by  Jfxec*  \\x  ~  c*j\\2  ~  \\x  —  c*||2-  As  c;  =  ric* ,  it  follows  that  this 
bound  is  exactly  Yhxec*  \\c*j  ~  ci  II 2  =  I U*|d2(c*,  c*),  see  [88]  (§2,  Theorem  2).  It  follows 
that  d2(c*,c*)  >  As  before,  d2(c*,c*)  <  (d(c* ,p)  +  d(p,  c*))2  <  4d2(c*,p),  so 

< ’  □ 

7.2.4  N  P-hardness  under  weak  deletion-stability 

Finally,  we  would  like  to  point  out  that  N P-hardness  of  the  /r-mcdian  problem  in  main¬ 
tained  even  if  we  restrict  ourselves  only  to  weakly  deletion- stable  instances.  Also  the 
reduction  sketched  below  uses  only  integer  poly-size  distances,  and  hence  rules  out  the 
existence  of  a  FPTAS  for  the  problem,  unless  P  =  NP.  In  addition,  the  reduction  can  be 
modified  to  show  that  N  P-hardness  is  maintained  under  the  conditions  studied  in  [121] 
and  [25]. 

Theorem  7.6.  For  any  constant  a  >  0,  finding  the  optimal  k-median  clustering  of(  1  +  a)- 
wecikly  deletion- stable  instances  is  NP -hard. 

Proof  Sketch.  Fix  any  constant  a  >  0.  We  give  a  1-1  poly-time  reduction  from  Set- 
Cover  to  (1  +  a)-wcakly  deletion-stable  A  - median  instances.  Under  standard  notation,  we 
assume  our  input  consists  of  n  subsets  of  a  given  universe  of  size  m,  for  which  we  seek  a 
k-c over.  We  reduce  such  an  instance  to  a  k- median  instance  over  m+k(n+4:/3km)  points. 
We  start  with  the  usual  reduction  of  Set-Cover  to  an  instance  with  m  points  representing 
the  items  of  the  universe  and  n  points  representing  all  possible  sets.  Fix  integer  l)  1 
to  be  chosen  later.  If  j  belongs  to  the  zth  set,  fix  the  distance  d(i,j )  =  D,  otherwise  we 
fix  the  distance  d(i,j )  =  D  +  1,  and  between  any  two  set-points  we  fix  the  distance  to 
be  1.  (The  distance  between  any  two  item  points  is  shortest-path  distance.)  However,  we 
augment  the  n  set-points  with  additional  2 mD  points,  setting  the  distance  between  all  of 
the  (n  +  2mD)  points  as  1.  Furthermore,  we  replicate  k  copies  of  the  augmented  set-point, 
all  connected  only  via  the  m- item  points. 

Observe  that  each  of  the  k  copies  of  our  augmented  set-points  components  contains 
many  points,  and  all  points  outside  this  copy  are  of  distance  >  D  from  it.  Therefore,  in 


98 


the  optimal  k- median  solution,  each  center  resides  in  one  unique  copy  of  the  augmented 
set-points.  Now,  if  our  Set-Cover  instance  has  a  k-c over,  then  we  can  pick  the  respective 
centers  and  have  an  optimal  solution  with  cost  exactly  k(n  +  2rnD  —  1)  +  mD.  Otherwise, 
no  k  sets  cover  all  m  items,  so  for  any  k  centers,  some  item-point  must  have  distance 
D  +  1  from  its  center,  and  so  the  cost  of  any  ^-partition  is  >  k(n  +  2 mD  —  1)  +  mD  +  1. 
Furthermore,  the  resulting  instance  is  (1  +  a)  weakly  deletion-stable  as  using  one  center 
from  each  augmented  set-point  results  in  a  k- median  solution  of  cost  <  m(D  + 1)  +  k(n  + 
2 mD  —  1).  Hence,  OPT  is  atmost  this  quantity.  However,  in  any  k  —  1  clustering,  one 
of  the  set-points  must  pay  a  high  cost  and  hence  OPT(fc_!)  >  (m  —  1  )D  +  (k  —  1  )(n  + 
2 mD  —  1)  +  (n  +  2 mD)D.  One  can  choose  D  large  enough  so  that  this  cost  is  at  least 
(l  +  a)OPT.  □ 

7.3  A  PTAS  for  any  /^-distributed  &>Median  Instance 

We  now  present  the  algorithm  for  finding  a  (1+e) -approximation  of  the  ^-median  optimum 
for  /^-distributed  instances.  First,  we  comment  that  using  a  standard  doubling  technique, 
we  can  assume  we  approximately  know  the  value  of  OPT.2  Our  algorithm  works  if  instead 
of  OPT  we  use  a  value  v  s.t.  OPT  <  v  <  (1  +  e/2)0PT,  but  for  ease  of  exposition,  we 
assume  that  the  exact  value  of  OPT  is  known. 

Below,  we  informally  describe  the  algorithm  for  a  special  case  of  /^-distributed  in¬ 
stances  in  which  no  cluster  dominates  the  overall  cost  of  the  optimal  clustering.  Specif¬ 
ically,  we  say  a  cluster  C*  in  the  optimal  A-mcdian  clustering  C*  (hereafter  also  referred 
to  as  the  target  clustering)  is  cheap  if  OPT,  <  — otherwise,  we  say  C*  is  expensive. 
Note  that  in  any  event,  there  can  be  at  most  a  constant  (//)  number  of  expensive  clusters. 

Algorithm  Intuition:  The  intuition  for  our  algorithm  and  for  introducing  the  notion  of 
cheap  clusters  is  the  following.  Pick  some  cluster  C*  in  the  optimal  A- median  clustering. 
Since  the  instance  is  /^-distributed,  any  x  C*  is  far  from  c* ,  namely,  d(x,  c*)  > 

In  contrast,  the  average  distance  of  x  G  C*  from  c*  is  Thus,  if  we  focus  on  a  cluster 

whose  contribution,  OPT,;,  is  no  more  than,  say,  ^OPT,  we  have  that  a*  is  100  times 
closer,  on  average,  to  the  points  of  C*  than  to  the  points  outside  C*.  Furthermore,  using 
the  triangle  inequality  we  have  that  any  two  “average”  points  of  C*  are  of  distance  at  most 
while  the  distance  between  any  such  “average”  point  and  any  point  outside  of  C* 

is  at  least  ^  •  So,  if  we  manage  to  correctly  guess  the  size  s  of  a  cheap  cluster,  we  can 

2Instead  of  doubling  from  1,  we  can  alternatively  run  an  off-the-shelf  5-approximation  of  OPT,  which 
will  return  a  value  v  <  50PT. 
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set  a  radius  r  =  0  (/3^p-)  and  collect  data-points  according  to  the  size  and  intersection 
of  the  r-balls  around  them.  We  note  that  this  use  of  balls  with  an  inverse  relation  between 
size  and  radius  is  similar  to  that  in  the  min-sum  clustering  algorithm  of  [28]. 

Note  that  in  the  general  case  we  might  have  up  to  expensive  clusters.  We  handle 
them  by  brute  force  guessing  their  centers.  In  Subsection  7.3.1,  we  present  the  algorithm 
for  clustering  /^-distributed  instances  of  A-mcdian  under  the  assumption  that  for  all  the 
expensive  clusters  we  have  made  the  correct  guess  for  their  cluster  centers.  The  algorithm 
populates  a  list  Q,  where  each  element  in  this  list  is  a  subset  of  points.  Ideally,  each  subset 
is  contained  in  some  target  cluster,  yet  we  might  have  a  few  subsets  with  points  from  two 
or  more  target  clusters.  The  first  stage  of  the  algorithm  is  to  add  components  into  Q,  and 
the  second  stage  is  to  find  k  good  components  in  Q,  and  use  these  k  components  to  retrieve 
a  clustering  with  low  cost. 

Since  we  do  not  have  many  expensive  clusters,  we  can  run  the  algorithm  for  all  possible 
guesses  for  the  centers  of  the  expensive  clusters  and  choose  the  solution  which  has  the 
minimum  cost.  The  analysis  below  shows  that  one  such  guess  will  lead  to  a  solution  of 
cost  at  most  (1  +  e)OPT.  Later,  in  Section  7.4,  when  we  deal  with  k- means  in  Euclidean 
space,  we  use  sampling  techniques,  similar  to  those  of  Kumar  et  al  [105]  and  Ostrovsky 
et  al  [121],  to  get  good  substitutes  for  the  centers  of  the  expensive  clusters.  Note  however 
an  important  difference  between  the  approach  of  [105,  121]  and  ours.  While  they  sample 
points  from  all  k  clusters,  we  sample  points  only  for  the  0(1)  expensive  clusters.  As  a 
result,  the  runtime  of  the  PTAS  of  [105,  121]  has  exponential  dependence  in  k,  while  ours 
has  only  a  polynomial  dependence  in  k. 

7.3.1  Clustering  /^-distributed  Instances 

The  algorithm  is  presented  in  Figure  7.1.  In  this  section  we  assume  that  at  the  beginning, 
the  list  Q  is  initialized  with  Qrnii  which  contains  the  centers  of  all  the  expensive  clusters. 
In  general,  the  algorithm  will  be  run  several  times  with  Qinu  containing  different  guesses 
for  the  centers  of  the  expensive  clusters.  Before  going  into  the  proof  of  correctness  of 
the  algorithm,  we  introduce  another  definition.  We  define  the  inner  ring  of  C*  as  the  set 
jo;;  d(x,c*)  <  Note  the  following  fact: 

Fact  7.7.  If  C*  is  a  cheap  cluster,  then  no  more  than  an  e / 4  fraction  of  its  points  reside 
outside  the  inner  ring.  In  particular,  at  least  half  of  a  cheap  cluster  is  contained  within  the 
inner  ring. 

Proof.  This  follows  from  Markov’s  inequality.  If  more  than  (e/4)|(7*|  points  are  outside 
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1.  Initialization  Stage:  Set  Q  Qm,t. 

2.  Population  Stage:  For  s  —  n,  n  —  1,  n  ~  2, . . . ,  1  do: 

(a)  Set  r  =  ^I. 

(b)  Remove  any  point  x  such  that  d(x,  Q )  <  2r. 

(Here,  d(x,  Q )  =  minTeQ;yeT  rf(x,  t/).) 

(c)  For  any  remaining  data  point  x,  denote  the  set  of  data  points  whose  distance 
from  x  is  at  most  r,  by  B(x ,  r).  Connect  any  two  remaining  points  a  and  b  if: 

(i)  d(a,b)  <  r,  (ii)  \B(a,r)\  >  |  and  (iii)  \B(b,  r)\  > 

(d)  Let  T  be  a  connected  component  of  size  >  Then: 

i.  Add  T  to  Q.  (That  is,  Q  <—  Q  U  {T}.) 

ii.  Define  the  set  B{T )  =  {a;  :  d(x,  y )  <  2 r  for  some  y  G  T}.  Remove  the 
points  of  B{T)  from  the  instance. 

3.  Centers-Retrieving  Stage:  For  any  choice  of  k  components  T\  ,  T2, . . . ,  Tk  out  of 
Q  (we  later  show  that  \  Q\  <  k  +  0(  1  / /?)) 

(a)  Find  the  best  center  c,  for  T,  U  B{Ti ).  That  is  ct  = 

argminpeT.uB(T.)  ^2xeT.uB{Ti)  d(x,p).a 

(b)  Partition  all  n  points  according  to  the  nearest  point  among  the  k  centers  of  the 
current  k  components. 

(c)  If  a  clustering  of  cost  at  most  (1  +  e) OPT  is  found  -  output  these  k  centers  and 
halt. 

“This  can  be  done  before  fixing  the  choice  of  k  components  out  of  Q. 

Figure  7.1:  A  PTAS  for  (3 -distributed  instances  of  A-median. 
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of  the  inner  ring,  then  OPT*  >  •  |j^y  =  (5eO?T / 32.  This  contradicts  the  fact  that 

C*  is  cheap.  □ 

Our  high  level  goal  is  to  show  that  for  any  cheap  cluster  C*  in  the  target  clustering,  we 
insert  a  component  T,  that  is  contained  within  C* ,  and  furthermore,  contains  only  points 
that  are  close  to  c*.  It  will  follow  from  the  next  claims  that  the  component  T,  is  the  one 
that  contains  points  from  the  inner  ring  of  C*.  We  start  with  the  following  Lemma  which 
we  will  utilize  a  few  times. 

Lemma  7.8.  Let  T  be  any  component  added  to  Q.  Let  s  be  the  stage  in  which  we  add  T 
to  Q.  Let  C*  be  any  cheap  cluster  s.t.  s  >  \C*  .  Then  (a)  T  does  not  contain  any  point  z 
s.t.  the  distance  d,(c*,z)  lies  within  the  range  /  °PT  ,  and  (b)  T  cannot  contain 

both  a  point  pi  s.t.  d(c*,pi)  <  f  arid  a  point  p2  s.t.  d(c*,p2)  >  ^y^/y. 

Proof.  We  prove  (a)  by  contradiction.  Assume  T  contains  a  point  z  s.t.  f  ^y  <  d(c*,z)  < 
°PT.  Set  r  =  72LI  <  ^^y,  just  as  in  the  stage  when  T  was  added  to  Q,  and  let  p  be 
any  point  in  the  ball  B(z,r).  Then  by  the  triangle  inequality  we  have  that  d(c*.p)  > 
d(c*,z)  -  d(z,p)  >  | ^py,  and  similarly  d(c*,p )  <  d(c* ,  z)  +  d(z,p)  <  Since  our 

instance  is  /^-distributed  it  holds  that  p  belongs  to  C*,  and  from  the  definition  of  the  inner 
ring  of  C* ,  it  holds  that  p  falls  outside  the  inner  ring.  However,  2  is  added  to  T  because 
the  ball  B{z.  r)  contains  more  than  s/2  >  \C*\/2  many  points.  So  more  than  half  of  the 
points  in  C*  fall  outside  the  inner  ring  of  C*.  which  contradicts  Fact  7.7. 

Assume  now  (b)  does  not  hold.  Recall  that  T  is  a  connected  component,  so  exists  some 
path  pi  — >  p2.  Each  two  consecutive  points  along  this  path  were  connected  because  their 
distance  is  at  most  '°lpr  <  if^pry.  As  d(c*,pi)  <  f^yiy  and  d{c*,p2)  >  p^^jy,  there  must 

exist  a  point  z  along  the  path  whose  distance  from  c*  falls  in  the  range  f  ^yty,  ^fyyyry  > 
contradicting  (a).  □ 

Claim  7.9.  Let  C*  be  any  cheap  cluster  in  the  target  clustering.  By  stage  s  =  C*  \ ,  the 
algorithm  adds  to  Q  a  component  T  that  contains  a  point  from  the  inner  ring  ofC*. 

Proof.  Suppose  that  up  to  the  stage  s  =  C*  the  algorithm  has  not  inserted  such  a  com¬ 
ponent  into  Q.  Now,  it  is  possible  that  by  stage  s,  the  algorithm  has  inserted  some  com¬ 
ponent  T'  to  Q,  s.t.  some  x  in  the  inner  ring  of  C*  is  too  close  to  some  y  (E  T'  (namely, 
d(x,  y )  <  2 r),  thus  causing  x  to  be  removed  from  the  instance.  Assume  for  now  this  is 
not  the  case.  This  means  that  the  inner  ring  of  cluster  C*  still  contains  more  than  |  (7*1/2 
points.  Also  observe  that  all  inner  ring  points  are  of  distance  at  most  fj^jy  from  the  center, 
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so  every  pair  of  inner  ring  points  has  a  distance  of  at  most  Hence,  when  we  reach 

stage  s  =  C*  | ,  any  ball  of  radius  r  =  ;(]PT  =  centered  at  any  inner-ring  point, 

must  contain  all  other  inner-ring  points.  This  means  that  at  stage  s  =  C*  all  inner  ring 
points  are  connected  among  themselves,  so  they  form  a  component  (in  fact,  a  clique)  of 
size  >  s/2.  Therefore,  the  algorithm  inserts  a  new  component,  containing  all  inner  ring 
points. 

So,  by  stage  s  =  C*  | ,  one  of  two  things  can  happen.  Either  the  algorithm  inserts  a 
component  that  contains  some  inner  ring  point  to  Q,  or  the  algorithm  removes  an  inner 
ring  point  due  to  some  component  T'  e  Q.  If  the  former  happens,  we  are  done.  So  let  us 
prove  by  contradiction  that  we  cannot  have  only  the  latter. 

Let  s  >  C*  be  the  stage  in  which  we  throw  away  the  first  inner  ring  point  of  the 
cluster  C*.  At  stage  s  the  algorithm  removes  this  inner  ring  point  x  because  there  exists 
a  point  y  in  some  component  T'  e  Q,  s.t.  d(x,y)  <  2 r  =  /3°PT,  and  so  d(c*,y )  < 
d(c*,x)  +  d(x,  y )  <  |j^  +  This  immediately  implies  that  V  cannot  be 

the  center  of  an  expensive  cluster  since  any  such  point  will  be  at  a  distance  at  least 
from  c*.  Let  s'  >  s  >  \C*\  be  the  previous  stage  in  which  we  added  the  component  V  to 
Q.  As  Lemma  7.8  applies  to  T’ ,  we  deduce  that  d(c*,  y)  <  Recall  that  V  contains 

>  s'/ 2  >  C'*|/2  many  points,  yet,  by  assumption,  contains  none  of  the  \C*\/2  points  that 
reside  in  the  inner  ring  of  C*.  It  follows  from  Lact  7.7  that  some  point  iu  G  T'  must  belong 
to  a  different  cluster  C*.  Since  the  instance  is  /^-distributed,  we  have  that  d(c*,w )  > 

The  existence  of  both  y  and  w  in  T'  contradicts  part  (b)  of  Lemma  7.8.  □ 


We  call  a  component  T  e  Q  good  if  it  contains  an  inner  ring  point  of  some  cheap 
cluster  C*.  A  component  is  called  bad  if  it  is  not  good  and  is  not  one  of  the  initial  centers 
present  in  Qinit.  We  now  discuss  the  properties  of  good  components. 

Claim  7.10.  Let  T  be  a  good  component  added  to  Q,  containing  an  inner  ring  point  from 
a  cheap  cluster  C* .  (By  Claim  7.9  we  know  at  least  one  such  T  exists.)  Then:  (a)  all  points 
in  T  are  of  distance  at  most  | from  c*,  (b)  T  U  B(T)  is  fully  contained  in  C*,  and  (c) 
the  entire  inner  ring  ofC*  is  contained  in  T  U  B(T ),  and  (d)  no  other  component  T'  G  Q, 
T'  f  T,  contains  an  inner  ring  point  from  C*. 

Proof.  As  we  do  not  know  (d)  in  advance,  it  might  be  the  case  that  Q  contains  many  good 
components,  all  containing  an  inner-ring  point  from  the  same  cluster,  C*.  Out  of  these 
(potentially  many)  components,  let  T  denote  the  first  one  inserted  to  Q.  Denote  the  stage 
in  which  T  was  inserted  to  Q  as  s.  Due  to  the  previous  claim,  we  know  s  >  C*  \ ,  and  so 
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Lemma  7.8  applies  to  T.  We  show  (a),  (b),  (c)  and  (d)  hold  for  T,  and  deduce  that  T  is  the 
only  good  component  to  contain  an  inner  ring  point  from  C*. 

Part  (a)  follows  immediately  from  Lemma  7.8.  We  know  T  contains  some  inner  ring 
point  x  from  C*,  so  d(c*,x )  <  |  <  fy^y,  so  we  know  that  any  y  e  T  must  satisfy 

that  d(c*,y)  <  f  Since  we  now  know  (a)  holds  and  the  instance  is  /^-distributed,  we 
have  that  T  C  C*,  so  we  only  need  to  show  B(T)  C  C*.  Fix  any  y  e  B(T).  The  point 
y  is  assigned  to  B (T)  (thus  removed  from  the  instance)  because  there  exists  some  point 
x  G  T  s.t.  d(x,y)  <  2 r.  So  again,  we  have  that  d(c*,y)  <  d(c*,x)  +  d(x,y)  < 
which  gives  us  that  y  e  C*  (since  the  instance  is  /3- -distributed). 

We  now  prove  (c).  Because  of  (b),  we  deduce  that  the  number  of  points  in  T  is  at  most 
C*  | .  However,  in  order  for  T  to  be  added  to  Q,  it  must  also  hold  that  \T\  >  s/2.  It  follows 
that  s  <  2 1 C*  | .  Let  x  be  an  inner  ring  point  of  C*  that  belongs  to  T.  Then  the  distance  of 
any  other  inner  ring  point  of  C*  and  x  is  at  most  =  2 r.  It  follows  that  any 

inner  ring  point  of  C*  which  isn’t  added  to  T  is  assigned  to  B(T).  Thus  TL)B(T )  contains 
all  inner-ring  points.  Finally,  observe  that  (d)  follows  immediately  from  the  definition  of 
a  good  component  and  from  (c).  □ 


We  now  show  that  in  addition  to  having  all  k  good  components,  we  cannot  have  too 
many  bad  components. 


Claim  7.11.  We  have  less  than  16/ (3/3)  bad  components. 


Proof.  Let  T  be  a  bad  component,  and  let  s  be  the  stage  in  which  T  was  inserted  to  Q.  Let 
y  be  any  point  in  T,  and  let  C*  be  the  cluster  to  which  y  belongs  in  the  optimal  clustering 
with  center  c*.  We  show  d(c*,  y)  >  divide  into  cases. 

Case  1:  C*  is  an  expensive  cluster.  Note  that  we  are  working  under  the  assumption  that 
Qinit  contains  the  correct  centers  of  the  expensive  clusters.  In  particular,  Qintt  contains 
c*.  Also,  the  fact  that  point  y  was  not  thrown  out  in  stage  s  implies  that  d(c*,y )  >  2 r  = 

/3QPT  3/3QPT 

Is  >  8  s' 

Case  2:  C*  is  a  cheap  cluster  and  s  >  \C*\.  We  apply  Lemma  7.8,  and  deduce  that 
either  d(c*,y)  <  f  y^y  or  that  d(c*,y)  >  °PT  >  xaF-  ^ie  inner  ring  of  C* 
contains  >  \C*\/2  and  T  contains  >  s/2  >  \C*\/2  many  points,  none  of  which  is  an  inner 
ring  point,  some  point  w  G  T  does  not  belong  to  C*  and  hence  d(c*,w )  >  >  x  jZFf- 

Part  (b)  of  Lemma  7.8  assures  us  that  all  points  in  T  are  also  far  from  c*. 

Case  3:  C*  is  a  cheap  cluster  and  s  <  \C*\.  Using  Claim  7.9  we  have  that  some  good 
component  containing  a  point  x  from  the  inner  ring  of  C*  was  already  added  to  Q.  So  it 
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must  hold  that  d(x,  y )  >  2 r,  for  otherwise  we  removed  y  from  the  instance  and  it  cannot 
be  added  to  any  T.  We  deduce  that  d(c*,y)  >  d(x,y)  —  d(c* ,  x)  >  —  g^|  > 

All  points  in  T  have  distance  >  3/"^PT  from  their  respective  centers  in  the  optimal 
clustering,  and  recall  that  T  is  added  to  Q  because  T  contains  at  least  s/2  many  points. 
Therefore,  the  contribution  of  all  elements  in  T  to  OPT  is  at  least  3/3^PT .  It  follows  that 
we  can  have  no  more  than  16/3/3  such  bad  components.  □ 

We  can  now  prove  the  correctness  of  our  algorithm. 

Theorem  7.12.  The  algorithm  outputs  a  k-clustering  whose  cost  is  no  more  than  (1  + 

e)OPT. 

Proof.  Using  Claim  7. 10,  it  follows  that  there  exists  some  choice  of  k  components,  7/ , . . . ,  Tk, 
such  that  we  have  the  center  of  every  expensive  cluster  and  the  good  component  corre¬ 
sponding  to  every  cheap  cluster  C*.  Fix  that  choice.  We  show  that  for  the  optimal  clus¬ 
tering,  replacing  the  true  centers  {c*,  c*k}  with  the  centers  {ci,  c2, ...,  ck}  that  the 
algorithm  outputs,  increases  the  cost  by  at  most  a  (1  +  e)  factor.  This  implies  that  using 
the  {ci,  c2, ...,  ck }  as  centers  must  result  in  a  clustering  with  cost  at  most  (1  +  e)OPT. 

Fix  any  C*  in  the  optimal  clustering.  Let  OPT,  be  the  cost  of  this  cluster.  If  C*  is  an 
expensive  cluster  then  we  know  that  its  center  c*  is  present  in  the  list  of  centers  chosen. 
Hence,  the  cost  paid  by  points  in  C*  will  be  at  most  OPT,.  If  C*  is  a  cheap  cluster  then 
denote  by  T  the  good  component  corresponding  to  it.  We  break  the  cost  of  C*  into  two 

parts:  OPTj  =  Ylxec*  d(xi  ci)  =  Ex<=TuB(T)d(x’ci)  +  ExeCf,  yet  x<£tub(t)  d(xi  ci)  anc^ 
compare  it  to  the  cost  C*  using  cr,  the  point  picked  by  the  algorithm  to  serve  as  center: 

Exec*  d(x>  °i)  =  ExeTuB(T)  d(xi  °i)  +  Exec*. yet xtTuB(T)  d(x>  c*)-  Now,  the  first  term  is 
exactly  the  function  that  is  minimized  by  q,  as  c,  =  arg  minp  EX£Tub(t)  ddx-  P)-  We  also 
know  c*,  the  actual  center  of  C* ,  resides  in  the  inner  ring,  and  therefore,  by  Claim  7.10 
must  belong  to  T  U  B(T ).  It  follows  that  E^xeTuB(T)  d(xi  ci)  <  Exgtub{t)  d(xi  ci)- 
We  now  upper  bound  the  2nd  term,  and  show  that  j2xeC*  yetX(£TuB(T)d(xici)  <  (1  + 

e)  ExeCf,  yet  x£TUB(T)  d(X ’  Ci ) 

Any  point  x  G  C*,  s.t.  x  </  T  U  B(T),  must  reside  outside  the  inner  ring  of  C*. 
Therefore,  d(x,c*)  >  We  show  that  d(ctJ  c*)  <  e  g|^|,  and  thus  we  have  that 

d(x,  <  d(x,  c*)  +  d(c*,  Ci)  <  (1  +  e)d(x,  c*),  which  gives  the  required  result. 

Note  that  thus  far,  we  have  only  used  the  fact  that  the  cost  of  any  cheap  cluster  is 
proportional  to  ,30  PT / 1 C*  \ .  Here  is  the  first  (and  the  only)  time  we  use  the  fact  that  the 
cost  is  actually  at  most  (e/32)  •  //OPT /\C*\.  Using  the  Markov  inequality,  we  have  that 
the  set  of  points  satisfying  { x ;  d(x,c*)  <  e  ■  /30PT/(16|C*|)}  contains  at  least  half  of 
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the  points  in  C*,  and  they  all  reside  in  the  inner  ring,  thus  belong  to  T  U  B(T).  Assume 
for  the  sake  of  contradiction  that  <7(q,  c*)  >  e  Then  at  least  half  of  the  points  in  C* 

contribute  more  than  to  the  sum  J2X£Tub(t)  d(x,  q)-  It  follows  that  this  sum  is  more 

than  e|^Pj  >  OPT;.  However,  c,  is  the  point  that  minimizes  the  sum  J2X£Tub(t)  d(x,p), 
and  by  using  p  =  c*  we  have  XLstu b(t)  d(x,p)  <  OPT,;.  Contradiction. 

□ 


7.3.2  Runtime  analysis 

A  naive  implementation  of  the  2nd  step  of  algorithm  in  Section  7.3.1  takes  0(n3)  time 
(for  every  s  and  every  point  x,  find  how  many  of  the  remaining  points  fall  within  the  ball 
of  radius  r  around  it).  Finding  c,  for  all  components  takes  0{n 2)  time,  and  measuring  the 
cost  of  the  solution  using  a  particular  set  of  k  data  points  as  centers  takes  0(nk )  time. 
Guessing  the  right  k  components  takes  k°' 1  time.  Overall,  the  running  time  of  the 
algorithm  in  Figure  7.1  is  ()(ri:ik0(ldr>  'j.  The  general  algorithm  that  brute-force  guesses 
the  centers  of  all  expensive  clusters,  makes  iterations  of  the  given  algorithm,  so 

its  overall  running  time  is  n0{l^^k0(yl^\ 


7.4  A  PTAS  for  any  ^-distributed  Euclidean  &>Means  In¬ 
stance 

7.4.1  Intuition 

Analogous  to  the  k- median  algorithm,  we  present  an  essentially  identical  algorithm  for 
k- means  in  Euclidean  space.  Indeed,  the  fact  that  /c-means  considers  distances  squared, 
makes  upper  (or  lower)  bounding  distances  a  bit  more  complicated,  and  requires  that  we 
fiddle  with  the  parameters  of  the  algorithm.  In  addition,  the  centers  c*  may  not  be  data 
points.  However,  the  overall  approach  remains  the  same.  Roughly  speaking,  converting 
the  A  -mcdian  algorithm  to  the  A- means  case,  we  use  the  same  constants,  only  squared. 
As  before  we  handle  expensive  clusters  by  guessing  good  substitutes  for  their  centers  and 
obtain  good  components  for  cheap  clusters. 

Often,  when  considering  the  Euclidean  space  A;-mcans  problem,  the  dimension  of  the 
space  plays  an  important  factor.  In  contrast,  here  we  make  no  assumptions  about  the 
dimension,  and  our  results  hold  for  any  poly(n)  dimension.  In  fact,  for  ease  of  exposition, 
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we  assume  all  distances  between  any  two  points  were  computed  in  advance  and  are  given 
to  our  algorithm.  Clearly,  this  only  adds  0(n 2  ■  dim)  to  our  runtime.  In  addition  to  the 
change  in  parameters,  we  utilize  the  following  facts  that  hold  for  the  center  of  mass  in 
Euclidean  space. 

Fact  7.13.  Let  U  be  a  (finite)  set  of  points  in  an  Euclidean  space,  and  let  pu  denote  their 
center  of  mass  (n  =  \m  East/  x)-  ^eI  A  be  a  random  subset  ofU,  and  denote  by  H  a  the 
center  of  mass  of  A.  Then  for  any  S  <  1/2,  we  have  both 
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Fact  7.14.  Let  U  be  a  (finite)  set  of  points  in  an  Euclidean  space,  and  let  A 
be  a  partition  of  U.  Denote  by  pu  and  pa  the  center  of  mass  of  U  and  A 
I \du  -  La\\2  <  j£|  E*etr  Ik  -  du\\2  •  p{- 


f  0  and  B 
resp.  Then 


Fact  7.14,  proven  in  [121]  (Lemma  2.2),  allows  us  to  upper  bound  the  distance  between 
the  real  center  of  a  cluster  and  the  empirical  center  we  get  by  averaging  all  points  in 
T  U  B(T)  for  a  good  component  T.  Fact  7.13  allows  us  to  handle  expensive  clusters. 
Since  we  cannot  brute  force  guess  a  center  (as  the  center  of  the  clusters  aren’t  necessarily 
data  points),  we  guess  a  sample  of  0(/3~ 1  +  e_1)  points  from  every  expensive  cluster,  and 
use  their  average  as  a  center.  Both  properties  of  Fact  7.13,  proven  in  [88]  (§3,  Lemma 
1  and  2),  assure  us  that  the  center  is  an  adequate  substitute  for  the  real  center  and  is  also 
close  to  it.  This  motivates  the  approach  behind  our  first  algorithm,  in  which  we  brute-force 
traverse  all  choices  of  0(e-1  +  1 )  points  for  any  of  the  expensive  clusters. 

The  second  algorithm,  whose  runtime  is  (k  logn)poly^1/e,1/^0(n3),  replaces  brute- 
force  guessing  with  random  sampling.  Indeed,  if  a  cluster  contains  poly(l /k)  fraction 
of  the  points,  then  by  randomly  sampling  0(e~ 1  +  /3_1)  points,  the  probability  that  all 
points  belong  to  the  same  expensive  cluster,  and  furthermore,  their  average  can  serve  as  a 
good  empirical  center,  is  at  least  i//i;poly(1/e>1//3)-  jn  contrast,  if  we  have  expensive  clusters 
that  contain  few  points  (e.g.  an  expensive  cluster  of  size  y/n,  while  k  =  poly(log(n))), 
then  random  sampling  is  unlikely  to  find  good  empirical  centers  for  them.  However,  recall 
that  our  algorithm  collects  points  and  deletes  them  from  our  instance.  So,  it  is  possible  that 
in  the  middle  of  the  run,  we  are  left  with  so  few  points,  so  that  expensive  clusters  whose 
size  is  small  in  comparison  to  the  original  number  of  points,  contain  a  poly(l /k)  fraction 
of  the  remaining  points. 
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Indeed,  this  is  the  motivation  behind  our  second  algorithm.  We  run  the  algorithm 
while  interleaving  the  Population  Stage  of  the  algorithm  with  random  sampling.  Instead 
of  running  s  from  n  to  1,  we  use  {n,  p,  p, . . . ,  l}  as  break  points.  Correspond¬ 
ingly,  we  define  /,  to  be  the  number  of  expensive  clusters  whose  size  is  in  the  range 
[n  ■  k~2l~2,  n  ■  k~21).  Whenever  s  reaches  such  a  n  ■  k~21  break  point,  we  randomly  sample 
points  in  order  to  guess  the  li+3  centers  of  the  clusters  that  lie  3  intervals  “ahead”  (and  so, 
initially,  we  guess  all  centers  in  the  first  3  intervals).  We  prove  that  in  every  interval  we 
are  likely  to  sample  good  empirical  centers.  This  is  a  simple  corollary  of  Fact  7.14  along 
with  the  following  two  claims.  First,  we  claim  that  at  the  end  of  each  interval,  the  number 
of  points  remaining  is  at  most  n  ■  k  2l+1.  Secondly,  we  also  claim  that  in  each  interval  we 
do  not  remove  even  a  single  point  from  a  cluster  whose  size  is  smaller  than  n  ■  A;-2*-6. 

7.4.2  A  Deterministic  Algorithm  for  ^-distributed  fc-Means  Instances 

Our  algorithm  is  presented  in  Figure  7.2.  The  correctness  is  proved  in  a  similar  fashion 
to  the  proof  of  correctness  presented  in  Section  7.3.  Much  like  in  Section  7.3,  we  call 
a  cluster  in  the  optimal  k- means  solution  cheap  if  OPT,  =  Yhx&c*  d2(x,c*)  <  /3e°6PT . 
First,  observe  that  by  the  Markov  inequality,  for  any  cheap  cluster  C*,  we  have  that  the 

set  jx;  d2(x,c*)  >  t  j  cannot  contain  more  than  e/(4 6t)  fraction  of  the  points  in 

\CJ*\.  It  follows  that  the  inner  ring  of  C*.  the  set  jx;  d2(x,c*)  <  256\c*\ }’  contains  at 
least  half  of  the  points  of  C*.  The  algorithm  populates  the  list  Q  with  good  components 
corresponding  to  cheap  clusters.  Also  from  Fact  7.13,  we  know  that  for  every  expensive 
cluster,  there  exists  a  sample  of  0(A  +  j)  data  points  whose  center  is  a  good  substitute 
for  the  center  of  the  cluster.  In  the  analysis  below,  we  assume  that  Q  has  been  initialized 
correctly  with  Qinu  containing  these  good  substitutes.  In  general,  the  algorithm  will  be 
run  multiple  times  for  all  possible  guesses  of  samples  from  expensive  clusters.  We  start 
with  the  following  lemma  which  is  similar  to  Lemma  7.8. 

Lemma  7.15.  Let  T  G  Q  be  any  component  and  let  s  be  the  stage  in  which  we  insert  T  to 
Q.  Let  CJ*  be  any  cheap  cluster  s.t.  s  >  C*  \ .  Then  (a)  T  does  not  contain  any  point  z  s.t. 
the  distance  d2(c*,  z)  lies  within  the  range  ^  f  ,  and  (b)T  cannot  contain  both 
a  point  pi  s.t.  d2(c*,pi)  <  and  a  point  p2  s.t.  d2(c*,p2)  >  f 

Proof.  Assume  (a)  does  not  hold.  Let  z  be  such  point,  and  let  B(z,  r )  be  the  set  of  all 
points  p  s.t.  d2(z,p)  <r=  —  64jc*|  •  d2(zi  ci )  —  ie\c*\ ,  we  ^ave  ^at  d(z’P)  — 

\ d(z,c :*).  It  follows  that  d2 (c * , p)  >  (d(c*,z)  —  d(z,p))2  >  (d(c*,z)/ 2)2  =  .  Sim- 
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1.  Initialization  Stage:  Set  Q  <—  Qinit. 


2.  Population  Stage:  For  s  —  n,  n  —  1,  n  2, . . . ,  1  do: 


(a) 

(b) 


Setr  =  ^I. 


64s 


Remove  any  point  x  such  that  d2(x,  Q )  <  4r. 
(Here,  d(x,  Q)  =  mmTeQ.yeT  d(x,  y).) 


(c)  For  any  remaining  data  point  x,  denote  the  set  of  data  points  whose  distance 
squared  from  x  is  at  most  r,  by  B(x,  r).  Connect  any  two  remaining  points  a 
and  b  if: 

(i)  d2(a,b )  <  r,  (ii)  \B(a,r)\  >  |  and  (iii)  \B(b,r)\  > 

(d)  Let  T  be  a  connected  component  of  size  >  |.  Then: 

i.  Add  T  to  Q.  (That  is,  Q  -t—  Q  U  {T}.) 

ii.  Define  the  set  B(T )  =  {x  :  d2(x,  y )  <  4r  for  some  y  e  T}.  Remove  the 
points  of  B(T)  from  the  instance. 


3.  Centers-Retrieving  Stage:  For  any  choice  of  k  components  T1;  T2, . . .  ,Tk  out  of 

Q 


(a)  Find  the  best  center  c%  for  Tt  U  BiTj). 

That  is  Cj  =  /x(Tj  U  B(Ti))  =  |T.uS(T.^  ^2x£TiUB(Ti)  x- 

(b)  Partition  all  n  points  according  to  the  nearest  point  among  the  k  centers  of  the 
current  k  components. 

(c)  If  a  clustering  of  cost  at  most  (1  +  e)OPT  is  found  -  output  these  k  centers  and 
halt. 


Figure  7.2:  A  deterministic  PTAS  for  /^-distributed  instances  of  Euclidean  A  - means. 
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ilarly,  d2(c*,p)  <  ( d(c*,z )  +  d(z,p))'2  <  {3d(c*,  z)/2)2  <  Thus  B(z,r )  is  con¬ 

tained  in  C*,  but  falls  outside  the  inner-ring  of  C* ,  yet  contains  s/2  >  C*  |  /  2  many  points. 
Contradiction. 


Assume  (b)  does  not  hold.  Let  p\  and  p2  the  above  mentioned  points.  As  T  is  a  con¬ 
nected  components,  it  follows  that  along  the  path  px  — *  p2,  exists  a  pairs  of  neighboring 
nodes,  x,y,  s.t.  d2(x,y)  <  r  <  f^-  yet  d2(c*,x)  <  ^^y  while  d2(c*,y )  >  f^y. 

However,  a  simple  computation  gives  that  d2(c*,y )  <  (3d(c*,x)/2)2  <  ||^y.  Contra¬ 
diction.  □ 


Lemma  7.15  allows  us  to  give  the  analogous  claims  to  Claims  7.9  and  7.10.  As  before, 
call  a  component  T  good  if  it  is  contained  within  some  target  cluster  C*  and  T  U  B(T) 
contains  all  of  the  inner  ring  points  of  C* .  Otherwise,  the  component  is  called  bad  provided 
it  is  not  one  of  the  initial  centers  present  in  Qmit-  We  now  show  that  each  cheap  target 
cluster  will  have  a  single,  unique,  good  component. 

Claim  7.16.  Let  C*  be  any  cheap  cluster  in  the  target  clustering.  By  stage  s  =  \C*\,  the 
algorithm  adds  to  Q  a  component  T  that  contains  a  point  from  the  inner  ring  ofC*. 

Claim  7.17.  Let  T  be  a  good  connected  component  added  to  Q,  containing  an  inner  ring 
point  from  cluster  C*.  Then:  (a)  all  points  in  T  are  of  distance  squared  at  most  from 
c*,  (b)  T  U  B(T)  is  fully  contained  in  C*,  and  (c)  the  entire  inner  ring  of  C*  is  contained 
in  T  U  B(T),  and  (d)  no  other  component  T1  f  T  in  Q  contains  an  inner  ring  point  from 

c*. 


As  the  proofs  of  Claims  7.16  and  7.17  are  identical  to  the  Claims  7.9  and  7.10,  we  omit 
them. 

Lemma  7.18.  We  do  not  add  to  Q  more  than  1000 /  j3  bad  components. 

Proof.  Consider  any  bad  component  T  that  we  add  to  Q  and  denote  that  stage  in  which 
we  insert  T  to  Q  as  s.  So  the  size  of  this  component  is  >  |.  Let  y  be  an  arbitrary  point 
from  T  which  belongs  to  cluster  C*  in  the  optimal  clustering.  Let  c*  be  the  center  of  C*. 
We  show  that  d2(c*,  y)  > 

We  divide  into  cases. 

Case  1:  C*  is  a  cheap  cluster  and  s  >  |C*|.  Recall  that  T  must  contain  s/2  >  \C*\/2 
points,  so  it  follows  that  T  contains  some  point  x  that  does  not  belong  to  C*.  /3-stability 
gives  that  this  point  has  distance  d2(c*,x)  >  /3y^y,  and  we  apply  Lemma  7.15  to  deduce 

that  all  points  in  T  are  of  distance  squared  of  at  least  |^y. 
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Case  2:  C*  is  a  cheap  cluster  and  s  <  C*  \ .  In  this  case  we  have  that  the  entire  inner 
ring  of  C*  already  belongs  to  some  T'  e  Q.  Let  x  e  T'  be  any  inner  ring  point  from 
C*,  and  we  have  that  d(c*,x)2  <  while  d2(x,y)  >  /3^T.  It  follows  that 

d2(c*,y)  >  (3d{x,y)/4:)2  > 

Case  3:  C*  is  an  expensive  cluster  and  s  >  2\C*\.  We  claim  that  d2(c*,  y)  >  f^yirj. 
If,  by  contradiction,  we  have  that  d2(c*,y )  <  |^j,  then  we  show  that  the  ball  B(y,  r) 
contains  only  points  from  C*,  yet  it  must  contains  s/2  >  C*  points.  This  is  because  each 

P  e  B(y,r)  satisfies  that  d2(c*,p )  <  {d(c*,y)  +  d(y,p)f  <  < 

/3QPT 

]c*|  • 

Case  4:  C*  is  an  expensive  cluster  and  s  <  2|C*|.  In  this  case,  from  Fact  7.13  we 
know  that  Qirat  contains  a  a  good  empirical  center  c  for  the  expensive  cluster  C*,  in  the 
sense  that  ||c  —  c*\\2  <  Then,  similarly  case  2  above  we  have  d2(y,  c*)  > 

( d(y ,  c )  —  d(c,  c*))2  >  It  follows  that  every  point  in  T  has  a  large  distance  from  its 
center.  Therefore,  the  s/2  points  in  this  component  contribute  at  least  /30PT /1000  to  the 
/c-means  cost.  Hence,  we  can  have  no  more  than  1000 / (3  such  bad  components.  □ 

We  now  prove  the  main  theorem. 

Theorem  7.19.  The  algorithm  outputs  a  k-clustering  whose  cost  is  at  most  (1  +  e)OPT. 

Proof.  Using  Claim  7.17,  it  follows  that  there  exists  some  choice  of  k  components  which 
has  good  components  for  all  the  cheap  clusters  and  good  substitutes  for  the  centers  of  the 
expensive  clusters.  Fix  that  choice  and  consider  a  cluster  C*  with  center  c*.  If  C*  is  an 
expensive  cluster  then  from  Section  7.4  we  know  that  Qinu  contains  a  point  c%  such  that 
d2(ci,c*)  <  Hence,  the  cost  paid  by  the  points  in  C*  will  be  atmost  (l  +  e)OPT.j. 

If  C*  is  a  cheap  cluster  then  denote  by  T  the  good  component  that  resides  within  C*. 
Denote  TUB(T )  by  A,  and  C*  \ /l  by  B.  Let  c,  be  the  center  of  A.  We  know  that  the  entire 
inner-ring  of  C*  is  contained  in  A,  therefore,  B  cannot  contain  more  than  e/16  fraction  of 
the  points  of  C*.  Fact  7.14  dictates  that  in  this  case,  ||c*  —  c;||2  <  e2  .  We  know  every 

x  G  B  contributes  at  least  2^\c*\  to  cost  so  Hci  —  C*H2  —  —  c* ||2.  Thus,  for 

every  a;  G  B,  we  have  that  1 1 a;  —  c* 1 1 2  <  (H-e)||x  —  c*\\2.  It  follows  that  JfxeB  Ik- c*||2  < 
(!  +  e)  YIxcb  ||a;  —  c*  ||2,  and  obviously  Ik  —  ci||2  <  Ik  —  ci  l|2  as  c*  the 

center  of  mass  of  A.  Therefore,  when  choosing  the  good  k  components  out  of  Q,  we  can 
assign  them  to  the  centers  in  such  a  way  that  costs  no  more  than  (l  +  e)OPT.  Obviously  the 
assignment  of  each  point  to  the  nearest  of  the  A  -ccntcrs  only  yields  a  less  costly  clustering, 
and  thus  its  cost  is  also  at  most  (1  +  e)OPT.  □ 
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7.4.3  A  Randomized  Algorithm  for  ^-distributed  &>Means  Instances 

We  now  present  a  randomized  algorithm  which  achieves  a  (1  +  e)  approximation  to  the 
/c-means  optimum  of  a  /^-distributed  instance  and  runs  in  time  (k  logfc  n)poly^1/<E’1///3^0(n3). 
The  algorithm  is  similar  in  nature  to  the  one  presented  in  the  previous  section,  except  that 
for  expensive  clusters  we  replace  brute  force  guessing  of  samples  with  random  sampling. 
Note  that  the  straightforward  approach  of  sampling  the  points  right  at  the  start  of  the  algo¬ 
rithm  might  fail,  if  there  exist  expensive  clusters  which  contain  very  few  points.  A  better 
approach  is  to  interleave  the  sampling  step  with  the  rest  of  the  algorithm.  In  this  way  we 
sample  points  from  an  expensive  cluster  only  when  it  contains  a  reasonable  fraction  of  the 
total  points  remaining,  hence  our  probability  of  success  is  noticeable  (namely,  poly(l//c)). 
The  alterations  required  in  making  the  previous  algorithm  into  a  randomized  one  that  also 
samples  cluster  centers  are  detailed  in  Figure  7.3. 

The  high-level  approach  of  the  algorithm  is  to  partition  the  main  loop  of  the  Population 
Stage,  in  which  we  try  all  possible  values  of  s  (starting  from  n  and  ending  at  1),  into 
intervals.  In  interval  i  we  run  s  on  all  values  starting  with  and  ending  with  -jfhpi-  So 
overall,  we  have  no  more  than  t  =  |  logfc(n)  intervals.  Our  algorithm  begins  by  guessing 
/,  the  number  of  expensive  clusters,  then  guessing  g1,  g2,...,gt  s.t.  JA  gt  =  l.  Each  gt 
is  a  guess  for  the  number  of  expensive  clusters  whose  size  lies  in  the  range  ,  k2u-i, )  • 

6  46 

Note  that  g{  =  #  expensive  clusters  <  j~e.  Hence,  there  are  at  most  (  logA,  n)757  number 
of  possible  assignments  to  gf s  and  we  run  the  algorithm  for  every  such  possible  guess. 

Fixing  g1,g2,...,gt,we  run  the  Population  Stage  of  the  previous  algorithm.  However, 
whenever  s  reaches  a  new  interval,  we  apply  random  sampling  to  obtain  good  empirical 
centers  for  the  expensive  clusters  whose  size  lies  three  intervals  “ahead” .  That  is,  in  the 
beginning  of  interval  i,  the  algorithm  tries  to  collect  centers  for  the  clusters  whose  size 
>  =  p,  yet  <  =  p.  We  assume  for  this  algorithm  that  k  is  significantly 

greater  than  Obviously,  if  k  is  a  constant,  then  we  can  use  the  existing  algorithm  of 
Kumar  et  al  [105]. 

In  order  to  prove  the  correctness  of  the  new  algorithm,  we  need  to  show  that  the  sam¬ 
pling  step  in  the  initialization  stage  succeeds  with  noticeable  probability.  Let  f  be  the 
actual  number  of  expensive  clusters  whose  size  belongs  to  the  range  [pr,  "i, , )  ■  In  the 
proof  which  follows,  we  assume  that  the  correct  guess  for  If s  has  been  made,  i.e.  =  /*, 
for  every  i.  We  say  that  the  algorithm  succeeds  at  the  end  of  inter\>al  i  if  the  following 
conditions  hold: 

1.  In  the  beginning  of  the  interval,  our  guess  for  all  clusters  that  belong  to  interval 
(i  +  3)  produces  good  empirical  centers.  That  is,  for  every  expensive  cluster  C*  of 
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1.  Guess  l  <  j~e,  the  number  of  expensive  clusters.  Set  t  =  ^(logpn).  Guess  non¬ 
negative  integers  g1,  g2, . . .  gt,  such  that  JA  g,  =  /. 

2.  Sample  gi+  g2  +  .9.3  sets,  by  sampling  independently  and  u.a.r  0(4  +  j)  points  for 
each  set.  For  each  such  set  T},  add  the  singleton  {p^Tj)}  to  Q. 

3.  Modify  the  Population  Stage  from  the  previous  algorithm,  so  that  whenever  s  = 
for  some  i  >  1  (We  call  this  the  interval  i ) 

•  Sample  gi+ 3  sets,  by  sampling  independently  and  u.a.r  0(j  +  j)  points  for 
each  set.  For  each  such  set  Tj,  add  the  singleton  {ju(Tj)}  to  Q. 

4.  Proceed  with  the  Centers-Retrieving  Stage  as  before. 


Figure  7.3:  A  Randomized  PTAS  for  /3 -distributed  instances  of  Euclidean  A-mcans,  that 
succeeds  w.p.  k  0(A+e\ 


size  in  the  range  [-jfhr,  prs).  the  algorithm  picks  a  sample  T  such  that  the  mean 
n(T)  satisfies: 

(a)  dVUV)  <  lipt,. 

0>)  E.6C-  <P(x,h(T))  <  (1  +  c)  E«6C-  <T(x,c‘). 

2.  During  the  interval,  we  do  not  delete  any  point  p  that  belongs  to  some  target  cluster 

C*  of  size  <  fc4+2"+1)  points. 

3.  At  the  end  of  the  interval,  the  total  number  of  remaining  points  (points  that  were  not 
added  to  some  T  e  Q  or  deleted  from  the  instance  because  they  are  too  close  to 
some  T'  G  Q)  is  at  most 

Lemma  7.20.  For  every  1  >  1,  let  Si  denote  the  event  that  the  algorithm  succeeds  at  the 
end  of  interval  i.  Then  Pr[S'j|S'i,  S2, ... ,  Si-i]  >  k~l(i+3y0('^+7'> 

Before  going  into  the  proof  we  show  that  Lemma  7.20  implies  that  with  noticeable 
probability,  our  algorithm  returns  a  (1  +  e) -approximation  of  the  A  -mcans  optimal  clus¬ 
tering.  First,  observe  the  technical  fact  that  for  the  first  three  intervals  Z1?  l2, 13,  we  need 
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to  guess  the  centers  of  clusters  of  size  >  p  before  we  start  our  Population  Stage.  How¬ 
ever,  as  these  clusters  contain  k~ 6  fraction  of  the  points,  then  using  Fact  7.13,  our  sam¬ 
pling  finds  good  empirical  centers  for  all  of  these  h  +  l2  +  h  expensive  clusters  w.p. 
>  k~^ll+l2+l3^°^+1\  Applying  Lemma  7.20  we  get  that  the  probability  our  algorithm 

Q(  /3+e  \ 

succeeds  after  all  intervals  is  >  l/rwf  Now,  a  similar  analysis  as  in  the  previous  sec¬ 
tion  gives  us  that  for  the  correct  guess  of  the  good  components  in  Q,  we  find  a  clustering 
of  cost  at  most  (1  +  e)OPT. 


Proof  of  Lemma  7.20.  Recall  that  /3  is  a  constant,  whereas  k  is  not.  Specifically,  we  as¬ 
sume  throughout  the  proof  that  k2  >  pp,  and  so  we  allow  ourselves  to  use  asymptotic 
notation. 

We  first  prove  that  condition  2  holds  during  interval  i.  Assume  for  the  sake  of  con¬ 
tradiction  that  for  some  cluster  C*  whose  size  is  less  than  there  exists  some  point 
y  G  C*,  which  was  added  to  some  component  T  during  interval  i,  at  some  stage  s  G 
[ j^t+2,  pi)-  This  means  that  by  setting  the  radius  r  =  l().1’1 ,  the  ball  B{y.  r)  contains 

>  s/2  >  2k2i+2  points.  Since  C*  contains  at  most  many  point,  we  have  \C*\  <C  s/2, 
so  at  least  s/4  points  in  B(y,  r )  belong  to  other  clusters.  Our  goal  is  to  show  that  these 
s/4  points  contribute  more  than  OPT  to  the  target  clustering,  thereby  achieving  a  contra¬ 
diction. 

Let  x  be  such  point,  and  denote  the  cluster  that  x  is  assigned  to  in  the  target  cluster¬ 
ing  by  C*  f  C*.  Since  the  instance  is  /^-distributed  we  have  that  <*V,  x)  >  f/r  > 

/^Opt^TL  On  the  other  hand,  d2(x,  y)  <  r  —  /30PTp^  <  [3 OPT^p.  Therefore, 
d2(c*,x )  =  Ll(k4)  ■  r,  so  d2(y,c*)  =  (. d(c*,x )  —d(x,y))2  =  fl(k4)  ■  r.  Recall  that  in 
the  target  clustering  each  point  is  assigned  to  its  nearest  center,  so  d2(c*,y )  >  d2(c*,y )  = 

Q(k4)-r.  So  wehavethatd2(c*,a;)  >  (. d(c*,y )  —  d{x,y))~  =  Ll(k4)-r  =  Q(k4)  ■  /3g4T  pp 

So,  at  least  s/4  =  0(^5^)  points  contribute  Q(k4),306^T  p/-  to  the  cost  of  the  optimal 
clustering.  Their  total  contribution  is  therefore  Ll(k 2 )  •  (/  OPT  >  OPT.  Contradiction. 

A  similar  proof  gives  that  no  point  y  G  C*  is  deleted  from  the  instance  because  for 
some  x  G  T,  where  T  is  some  component  in  Q,  we  have  that  d2(y,x )  <  4 r.  Again, 
assume  for  the  sake  of  contradiction  that  such  y,x  and  T  exist.  Denote  by  s  G  [yfbi-  fk) 
the  stage  in  which  we  remove  y,  and  denote  by  s'  >  s  the  stage  in  which  we  insert  T  into 
Q.  By  setting  the  radius  r'  =  ppT  <  r,  we  have  that  the  ball  B(x,  r')  contains  at  least 
s'/ 2  >  s/2  points,  and  therefore,  the  ball  B(y,  5 r)  contains  at  least  s/2  points.  We  now 
continue  as  in  the  previous  case. 

We  now  prove  condition  1.  We  assume  the  algorithm  succeeded  in  all  previous  inter- 
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vals.  Therefore,  at  the  beginning  of  interval  i,  all  points  that  belong  to  clusters  of  size 
<  remain  in  the  instance,  and  in  particular,  the  clusters  we  wish  to  sample  from 
at  interval  i  remain  intact.  Furthermore,  by  the  assumption  that  the  algorithm  succeeded 
up  to  interval  (i  —  1),  we  have  that  each  expensive  cluster  that  should  be  sampled  at  the 
beginning  of  interval  i,  contains  a  1/k 7  fraction  of  the  remaining  points.  We  deduce  that 
the  probability  that  we  pick  a  random  sample  of  0(/  +  f )  points  from  such  expensive 

cluster  is  at  least  k~°^+7\  Using  Fact  7.13  we  have  that  with  probability  >  k~°('P+7'> 
this  sample  yields  a  good  empirical  center. 

We  now  prove  condition  3,  under  the  assumption  that  1  is  satisfied.  We  need  to  bound 
the  number  of  points  left  in  the  instance  at  the  end  of  interval  i.  There  are  two  types  of 
remaining  points:  points  that  in  the  target  clustering  belong  to  clusters  of  size  >  ,  and 

points  that  belong  to  clusters  of  size  <  To  bound  the  number  of  points  of  the  second 
type  is  simple  -  we  have  k  clusters,  so  the  overall  number  of  points  of  the  second  type  is 
at  most  2fc^_1 .  We  now  bound  the  number  of  remaining  points  of  the  first  type. 

At  the  end  of  the  interval  s  =  pS+2,  so  we  remove  from  the  instance  any  point  p  whose 
distance  (squared)  from  some  point  in  Q  is  at  most  4r  =  •  We  already  know  that 

by  the  end  of  interval  i,  either  by  successfully  sampling  an  empirical  center  or  by  adding 
an  inner-ring  point  to  a  component  in  Q,  for  every  cluster  C*  of  size  >  exists  some 
T  E  Q  with  a  point  d  G  T,  s.t.  d2(c*,c')  <  Thus,  if  x  G  C*  is 

a  point  that  wasn’t  removed  from  the  instance  by  the  end  of  interval  i,  it  must  hold  that 
d2(c*,x )  >  ( d(c',x )  —  d(c*,c'))2  =  n(k2l+2)^^.  Clearly,  at  most  n  ■  0(/c"2l~2)  points 
can  contribute  that  much  to  the  cost  of  the  optimal  k- means  clustering,  and  so  the  number 
of  points  of  the  first  type  is  at  most  9AJ-_i  as  well.  □ 

As  we  need  to  traverse  all  guesses  g,s,  the  runtime  of  this  algorithm  takes  0(n3(logfc 
Repeating  this  algorithm  k°^+7^  many  times,  we  increase  the  probability  of  success  to 
be  >  1/2,  and  incur  runtime  of  0(n3(\ogkn)°^^k°('d^z^. 

7.5  Discussion  and  Open  Problems 

The  algorithm  we  present  here  for  - median  has  runtime  of  poly(n1///3,  n1^,  k ),  and  the 
algorithm  for  k- means  has  runtime  poly(n,  (Hogn)1/6,  (k  logn)1^).3  We  comment  that  it 
is  unlikely  that  we  can  obtain  an  algorithm  of  runtime  poly  (n1//e,  l//3,k).  Observe  that  for 

3When  dealing  with  /c-means  in  a  Euclidean  space  of  dimension  dim,  we  need  to  explicitly  compute  the 
distances,  so  we  add  ?z2dim  to  the  runtime. 
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any  clustering  instance  and  any  k  >  1  we  have  that  0PJ|!)^r1)  >1+4,  simply  by  considering 
the  /-clustering  that  results  from  taking  the  optimal  (k  —  1) -clustering,  and  setting  the  point 
which  is  the  furthest  from  its  center  in  a  cluster  of  its  own  (as  a  new  center).  Hence,  any 
/>median//c-means  instance  is  /^-distributed  for  (3  =  fl(-).  Recall  from  Section  7.2.4 
the  k- median  problem  restricted  only  to  weakly-stable  instances  has  no  FPTAS.  So  the 
fact  that  our  algorithm’s  runtime  has  super-polynomial  dependence  in  both  1//3  and  1/e 
is  unavoidable.  Nonetheless,  one  might  still  hope  to  do  better.  In  particular,  one  major 
runtime  expense  of  our  algorithm  comes  from  handling  expensive  clusters  by  brute-force 
guessing  or  sampling.  Can  one  improve  the  runtime  by  doing  something  more  clever  for 
expensive  clusters?  It  is  worth  noting  that  for  the  stability  conditions  of  [25],  Voevodski  et 
al  [140]  develop  an  especially  efficient  implementation  with  good  performance  (in  terms 
of  both  accuracy  and  speed)  on  real-world  protein  sequence  datasets. 

A  different  open  problem  lies  in  the  relation  to  results  of  Ostrovsky  et  al  [121].  Their 
motivating  question  was  to  analyze  the  performance  of  Lloyd-type  methods  over  stable 
instances.  Is  it  possible  that  weak  deletion-stability  is  sufficient  for  some  version  of  the 
/  -means  heuristic  to  converge  to  the  optimal  clustering? 
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Chapter  8 

Center-based  Clustering  under 
Perturbation  Stability 

8.1  Introduction 

In  the  vast  field  of  clustering,  the  recent  work  of  Bilu  and  Linial  [40]  takes  a  refreshing 
approach  to  clustering.  Bilu  and  Linial  [40],  focusing  on  the  Max-Cut  problem  [71],  pro¬ 
posed  considering  instances  where  the  optimal  clustering  is  optimal  not  only  under  the 
given  metric,  but  also  under  any  bounded  multiplicative  perturbation  of  the  given  metric. 
Bilu  and  Linial  [40]  analyze  Max-Cut  instances  of  this  type  and  show  that  for  instances 
that  are  stable  to  perturbations  of  multiplicative  factor  roughly  0(n1//2),  one  can  retrieve 
the  optimal  Max-Cut  in  polynomial  time.  They  conjecture  that  stability  up  to  only  con¬ 
stant  magnitude  perturbations  should  be  enough  to  solve  the  problem  in  polynomial  time. 
(Center-based  clustering  and  the  recent  approach  proposed  by  Bilu  and  Linial  [40]  were 
discussed  in  detail  in  Chapter  4.2.3.) 

In  this  chapter  we  show  that  this  conjecture  is  indeed  true  for  A  - median  and  A-mcans 
objectives  and  in  fact  for  any  well-behaved  center-based  objective  function  (see  Defini¬ 
tion  4.2).  We  comment  that  in  the  previous  chapter,  Chapter  7,  we  studied  instances  satis¬ 
fying  “weak  deletion  stability”  -  where  merging  any  two  clusters  in  the  optimal  Absolution 
increases  the  cost  by  a  noticeable  factor.  This  notion  is  motivated  both  as  a  relaxation  of 
the  separation  condition  of  Ostrovsky  et  al  [121],  and  also  as  a  relaxation  of  the  notion 
considered  by  Balcan  et  al  [25].  However,  there  exists  “clear-cut”  instances,  where  the 
optimal  Abdustering  is  obvious,  which  weak-deletion  stability  fails  to  capture.  The  notion 
of  stability  studied  in  this  chapter,  perturbation  resilience,  indeed  capture  such  instances 
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(but  fails  to  capture  instances  satisfying  the  weak-deletion  stability  notion  of  Chapter  7). 
Example  is  given  in  Figure  8.1. 


k 


(a)  Instance  satisfying  perturbation  resilience 
but  isn  ’t  weak-deletion  stable.  The  distance  of 
any  point  to  its  cluster  center  is  1  and  distances 
between  two  different  centers  is  5  and  remain¬ 
ing  distances  are  short-path  distances.  With  k 
clusters,  each  with  ^  points,  merging  two  clus¬ 
ters  increases  the  cost  by  a  sub-constant  factor 
of  0(l/k). 


(b)  Instance  satisfying  weak-deletion  stability  but  isn ’t 
perturbation  resilient.  The  cluster  centers  are  within  dis¬ 
tance  k  apart,  but  there  are  many  “middle”  points  whose 
distance  to  both  centers  is  roughly  k/2.  Perturbing  the 
distances  can  cause  the  middle  points  to  change  clusters. 


Figure  8.1:  Instances  satisfying  one  notion  of  stability  but  no  the  other. 


8.1.1  Main  Result 

For  clarity,  let  us  formally  redefine  the  notions  of  stability  under  multiplicative  perturba¬ 
tions. 

Definition  8.1.  Given  a  metric  ( S ,  d),  and  a  >  1,  we  say  a  function  d'  :  S  x  S  —t  M>0  is 
an  Q-pcrturbation  of  d,  if  for  any  x,y  E  S  it  holds  that 

d(x,y )  <  d'(x,y )  <  ad(x,y ) 

Note  that  in  this  definition,  much  like  in  the  definition  of  [40],  d  is  a  metric  (satisfying 
reflexivity,  symmetry  and  triangle  inequality),  yet  d!  may  be  any  non-negative  function. 
In  particular,  we  allow  d!  to  not  satisfy  the  triangle  inequality.  We  now  give  our  main 
definitions  and  main  theorem. 
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Definition  8.2.  Suppose  we  have  a  clustering  instance  composed  of  n  points  residing  in  a 
metric  (S.  d)  and  an  objective  function  $  we  wish  to  optimize.  We  call  the  clustering  in¬ 
stance  a-perturbation  resilient  for  $  if  for  any  d!  which  is  an  a-perturbation  ofd,  the  ( only) 
optimal  clustering  of  (S,  d')  under  $  is  identical,  as  a  partition  of  points  into  subsets,  to 
the  optimal  clustering  of  (S,  d)  under  <f>. 

We  will  in  particular  be  concerned  with  separable,  center-based  clustering  objectives 
$  (which  include  A-mcdian,  A- means,  and  A;-center  among  others). 

Definition  8.3.  A  clustering  objective  is  center-based  if  the  optimal  solution  can  be  defined 
by  k  points  cj, ,  c*k  in  the  metric  space  called  centers  such  that  every  data  point  is 
assigned  to  its  nearest  center.  Such  a  clustering  objective  is  separable  if  it  furthermore 
satisfies  the  following  two  conditions: 

•  The  objective  function  value  of  a  given  clustering  is  either  a  (weighted)  sum  or  the 
maximum  of  the  individual  cluster  scores. 

•  Given  a  proposed  single  cluster,  its  score  can  be  computed  in  polynomial  time. 

Our  main  result  is  that  we  can  efficiently  find  the  optimal  clustering  for  perturbation- 
resilient  instances  of  separable  center-based  clustering  objectives.  In  particular,  we  get  an 
efficient  algorithm  for  3 -perturbation-resilient  instances  when  the  metric  S  is  defined  only 
over  data  points,  and  for  (2  +  v/3)-perturbation-resiliant  instances  for  general  metrics. 

Theorem  8.4.  For  a  >  3  (in  the  case  of  finite  metrics  defined  only  over  the  data)  or 
a  >  2  +  a/3  (for  general  metrics),  there  is  a  polynomial-time  algorithm  that  finds  the 
optimal  clustering  of  a-perturbation  resilient  instances  for  any  given  separable  center- 
based  clustering  objective. 

The  algorithm,  described  in  Section  8.2.2,  turns  out  to  be  quite  simple.  As  a  first  step,  it 
runs  the  classic  single-linkage  algorithm,  but  unlike  the  standard  approach  of  halting  when 
k  clusters  remain,  it  runs  the  algorithm  until  all  points  have  been  merged  into  a  single 
cluster  and  keeps  track  of  the  entire  tree-on-clusters  produced.1  Then,  the  algorithm’s 
second  step  is  to  apply  dynamic  programming  to  this  hierarchical  clustering  to  identify 
the  best  A;-clustering  that  is  present  within  the  tree.  Using  a  result  of  Balcan  et  al  [27]  we 
show  that  the  resulting  clustering  obtained  is  indeed  the  optimal  one.  Albeit  being  very 
different,  our  approach  resembles,  in  spirit,  the  work  of  Bartal  [32],  Abraham  et  al  [2]  and 

'The  example  depicted  in  Figure  8.4  proves  that  indeed,  halting  the  Single-Linkage  algorithm  once  k 
clusters  are  formed  may  fail  on  certain  a-perturbation  resilient  instances. 
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Racke  [124]  in  the  sense  that  we  reduce  the  problem  of  retrieving  an  optimal  solution  from 
a  general  instance  to  a  tree-like  instance  (where  it  is  poly-time  solvable). 

Our  algorithms  use  only  a  weaker  property,  which  we  call  center-proximity  (see  Sec¬ 
tion  8.2.1),  that  is  implied  by  perturbation-resilience.  We  then  complement  these  results 
with  a  lower  bound  showing  that  for  the  problem  of  k- median  on  general  metrics,  for  any 
e  >  0,  there  exist  NP-hard  instances  that  satisfy  (3  —  e)-center  proximity.  We  note  that 
while  our  belief  was  that  allowing  Steiner  points  in  the  lower  bound  was  primarily  a  tech¬ 
nicality,  Balcan  and  Liang  [29]  have  recently  shown  this  is  not  the  case,  giving  a  clever 
algorithm  that  finds  the  optimal  clustering  for  /.  -median  instances  in  finite  metrics  when 
a  =  l  +  y/2,  and  Reyzin  [127]  gave  a  NP-hardness  result  for  clustering  instances  satisfying 
(2  —  e) -center  proximity. 


8.2  Proof  of  Main  Theorem 

8.2.1  Properties  of  Perturbation  Resilient  Instances 

We  begin  by  deriving  other  properties  which  every  3-perturbation  resilient  clustering  in¬ 
stance  must  satisfy. 

Definition  8.5.  Let  p  <G  S  be  an  arbitrary  point,  let  c*  be  the  center  p  is  assigned  to  in  the 
optimal  clustering,  and  let  c*  f  c*  be  any  other  center  in  the  optimal  clustering.  We  say  a 
clustering  instance  satisfies  the  a -center  proximity  property  if  for  any  p  it  holds  that 

d(p,  c*j)  >  ad(p,c*) 

Fact  8.6.  If  a  clustering  instance  satisfies  the  a-perturbation  resilience  property,  then  it 
also  satisfies  the  a-center  proximity  property. 

Proof.  Let  C*  and  C*  be  any  two  clusters  in  the  optimal  clustering  and  pick  any  p  e  C*. 
Assume  we  blow  up  all  the  pairwise  distances  within  cluster  C*  by  a  factor  of  a.  As 
this  is  a  legitimate  perturbation  of  the  metric,  it  still  holds  that  the  optimal  clustering 
under  this  perturbation  is  the  same  as  the  original  optimum.  Hence,  p  is  still  assigned 
to  the  same  cluster.  Furthermore,  since  the  distances  within  C*  were  all  changed  by  the 
same  constant  factor,  c*  will  still  remain  an  optimal  center  of  cluster  i.  The  same  holds 
for  cluster  C* .  It  follows  that  even  in  this  perturbed  metric,  p  prefers  c*  to  c*.  Hence 
ad(p,  c*)  =  d'(p,  c*)  <  d'(p,  c*)  =  d(p ,  c*).  □ 
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Corollary  8.7.  For  every  point  p  and  its  center  c*,  and  for  every  point  p'  from  a  different 
cluster,  it  follows  that  d(p,p')  >  (a  —  1  )d(p,c*). 

Proof  Denote  by  c*  the  center  of  the  cluster  that  p'  belongs  to.  Now,  consider  two 
cases.  Case  (a):  d(p',c*)  >  d(p,c*).  In  this  case,  by  traingle  inequality  we  get  that 
d{p,  p')  >  d(p c* )  —  d(p,  c*).  Since  the  data  instance  is  stable  to  a-perturbations,  Fact  8.6 
gives  us  that  d{p’,c *)  >  ad(p',c*).  Hence  we  get  that  d(p,p')  >  ad(p',c*)  —  d(p,c*) 
>  (a  —  1  )d(p,  c*).  Case  (b):  d{p',  c *)  <  d(p ,  c*).  Again  by  traingle  inequality  we  get  that 
d(p,p')  >  d(p,  c*)  -  d(p',  c*)  >  ad(p,  c*)  -  d(p',  c*)  >  (a  -  1  )d(p,  c*).  □ 

A  key  ingredient  in  the  proof  of  Theorem  8.4  is  the  tree-clustering  formulation  of 
Balcan  et.  al  [27].  In  particular,  we  prove  that  if  an  instance  satisfies  a-ccntcr  proximity 
for  a  >  3  then  it  also  satisfies  the  “min  stability  property”  (defined  below).  This  property, 
as  shown  in  [27],  is  a  (necessary  and)  sufficient  condition  for  the  Single-Linkage  algorithm 
to  produce  a  tree  such  that  the  optimal  clustering  is  some  pruning  of  this  tree.  In  order  to 
define  the  “min- stability”  property,  we  first  introduce  the  following  notation.  For  any  two 
subsets  A,  B  C  S,  we  denote  the  minimum  distance  between  A  and  B  as  dmin(A,  B )  = 
min{d(a,  b )  |  a  G  A,  b  G  B}. 

Definition  8.8.  A  clustering  instance  satisfies  the  min-stability  property  if  for  any  two 
clusters  C  and  C  in  the  optimal  clustering,  and  any  two  subsets  /l  C  C,  A1  C  6",  it  holds 
that  dmin(A,  C\A)<  dmin(A,  A'). 

In  words,  the  min-stability  property  means  that  for  any  set  A  that  is  a  strict  subset  of 
some  cluster  C  in  the  optimal  clustering,  the  closest  point  to  A  is  a  point  from  C\A,  and 
not  from  some  other  cluster.  The  next  two  lemmas  lie  at  the  heart  of  our  algorithm. 

Lemma  8.9.  A  clustering  instance  (for  a  center-based  clustering  objective )  in  which  cen¬ 
ters  are  data  points,  that  satisfies  a-center  proximity  for  a  >  3,  also  satisfies  the  min- 
stability  property. 

Proof  Let  C* ,  C*  be  any  two  clusters  in  the  target  clustering.  Let  A  and  A  be  any  two 
subsets  s.t.  A  C  C*  and  A'  C  C*.  Let  p  e  A  and  pf  e  A  be  the  two  points  which  obtain 
the  minimum  distance  dm-m(A,  A).  Let  q  e  C*  \  A  be  the  nearest  point  to  p.  Also,  denote 
by  c*  and  c*  the  centers  of  clusters  C*  and  C*  respectively. 

For  the  sake  of  contradiction,  assume  that  dmin(A,  C*  \  A)  >  dmin(A ,  A).  Suppose 
c*  A.  This  means  that  d(p,p')  =  dmin(A,  A)  <  dmin(A,  C*  \  A)  <  d(p,  c*).  As  a  >  3, 

this  contradicts  Corollary  8.7. 
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Thus  we  may  assume  c*  G  A.  It  follows  that  d(q,  c* )  >  d(p,p')  >  (3  -  l)d(p,c*)  = 
2d(p,  c*),  so  d(p,  c*)  <  d(q ,  c*)/ 2.  We  therefore  have  that  d(p',  c*)  <  d{p,p')  +  d{p,  c*)  < 
3d{q,c*)/2.  This  implies  that  d(p',c*)  <  d(p',c*)/a  <  d{q,c*)/2,  and  thus  d{q,c*)  < 
d(q,c*)  +  d(c*,p)  +  d(p,p')  +  d(p',c*)  <  3 d(q,c*)  <  ad(q,c*).  This  contradicts  Fact  8.6. 

□ 

Lemma  8.10.  A  clustering  instance  (for  a  center-based  clustering  objective )  in  which 
centers  need  not  be  data  points,  that  satisfies  a-center  proximity  for  a  >  2  +  \A,  also 
satisfies  the  min-stability  property. 


Proof  As  in  the  proof  of  Lemma  8.9,  let  C* ,  C*  be  any  two  clusters  in  the  target  clustering 
and  let  A  and  A'  be  any  two  subsets  s.t.  A  C  C*  and  A1  C  C*.  Let  p  e  A  and  p ’  e  A ’  be 
the  two  points  which  obtain  the  minimum  distance  dm\„(A,  A1)  and  let  q  e  C*  \  A  be  the 
nearest  point  to  p.  Also,  as  in  the  proof  of  Lemma  8.9,  let  c*  and  c*  denote  the  centers  of 
clusters  C*  and  C*  respectively  (though  these  need  not  be  datapoints). 

By  definition  of  center-proximity,  we  have  the  following  inequalities: 

d(p,p')  +  d,(p',  c*)  >  ad(p,  c*)  [c.p.  applied  to  p] 
d{p,p')  +  d(p,  c *)  >  ad(p  ,  c*)  [c.p.  applied  to  p'] 
d(p,p')  +  d(p',c*)+d(p,q)  >  a(d(q,p)  —  d(p,  c*)) 

[center  proximity  applied  to  q  and  triangle  ineq.] 

Multiplying  the  first  inequality  by  1  —  the  second  by  the  third  by  ^y, 

and  summing  them  together  we  get 

Hp,p')  >  ^pUp,  c’)+d(q,p), 


which  for  a  =  2  +  s/3  implies  d{p,p')  >  d{q,p)  as  desired. 


□ 


8.2.2  The  Algorithm 

As  mentioned,  Balcan  et  al  [27]  proved  (Theorem  2)  that  if  an  instance  satisfies  min- 
stability,  then  the  tree  on  clusters  produced  by  the  single-linkage  algorithm  contains  the 
optimal  clustering  as  some  k -pruning  of  it.  I.e.,  the  tree  produced  by  starting  with  n  clus¬ 
ters  of  size  1  (viewed  as  leaves),  and  at  each  step  merging  the  two  clusters  C,C'  minimiz¬ 
ing  dmin(C ,  C)  (viewing  the  merged  cluster  as  their  parent)  until  only  one  cluster  remains. 
Given  the  structural  results  proven  above,  our  algorithm  (see  Figure  8.2)  simply  uses  this 
clustering  tree  and  finds  the  best  (  -pruning  using  dynamic  programming. 


122 


1 .  Run  Single-Linkage  until  only  one  cluster  remains,  producing  the  entire  tree  on 
clusters. 

2.  Find  the  best  A -pruning  of  the  tree  by  dynamic  programming  using  the  equality 

best-A-pruning(T)  =  min0<fc'<fc  {  best-/c/-pruning(T’s  left  child) 

+  best- (A;  —  A') -pruning (T’s  right  child)} 


Figure  8.2:  Algorithm  to  find  the  optimal  A-clustering  of  instances  satisfying  ct-center  proximity. 
The  algorithm  is  described  for  the  case  (as  in  A-median  or  A-means)  that  <I>  defines  the  overall  score 
to  be  a  sum  over  individual  cluster  scores.  If  it  is  a  maximum  (as  in  A-center)  then  replace  “+” 
with  “max”  above. 


Proof  of  Theorem  8.4.  By  Lemmas  8.9  and  8.10,  the  data  satisfies  the  min-stability  prop¬ 
erty,  which  as  shown  in  [27]  is  sufficient  to  guarantee  that  some  pruning  of  the  single¬ 
linkage  hierarchy  is  the  target  clustering.  We  then  find  the  optimal  clustering  using  dy¬ 
namic  programming  by  examining  A-partitions  laminar  with  the  single-linkage  clustering 
tree.  The  optimal  A-clustering  of  a  tree-node  is  either  the  entire  subtree  as  one  cluster  (if 
k  =  1),  or  the  minimum  over  all  choices  of  Ai-clusters  over  its  left  subtree  and  A2-clusters 
over  its  right  subtree  (if  k  >  1).  Here  k\ ,  A2  are  positive  integers,  such  that  k\  +  A2  =  k. 
Therefore,  we  just  traverse  the  tree  bottom-up,  recursively  solving  the  clustering  problem 
for  each  tree-node.  By  assumption  that  the  clustering  objective  is  separable,  so  each  step 
including  the  base-case  can  be  performed  in  polynomial  time.  For  the  case  of  A-median 
in  a  finite  metric,  for  example,  one  can  maintain  a  n  x  0(n)  table  for  all  possible  centers 
and  all  possible  clusters  in  the  tree,  yielding  a  running  time  of  0{n 2  +  nk2).  For  the  case 
of  A-means  in  Euclidean  space,  one  can  compute  the  cost  of  a  single  cluster  by  comput¬ 
ing  the  center  as  just  the  average  of  all  its  points.  In  general,  the  overall  running  time  is 
0(n(A2  +  T(n))),  where  T(n )  denotes  the  time  it  takes  to  compute  the  cost  of  a  single 
cluster.  □ 


8.2.3  Some  Natural  Barriers 

We  complete  this  section  with  a  discussion  of  barriers  of  our  approach.  First,  our  algorithm 
indeed  fails  on  some  finite  metrics  that  are  (3  —  e) -perturbation  resilient.  For  example, 
consider  the  instance  shown  in  Figure  8.3.  In  this  instance,  the  clustering  tree  produced  by 
single-linkage  is  not  laminar  with  the  optimal  A-median  clustering.  It  is  easy  to  check  that 
this  instance  is  resilient  to  a-pcrturbations  for  any  a  <  3. 
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Figure  8.3:  An  example  of  a  finite  metric  A- median  instance  with  2  <  a  <  3  where  our  algorithm 
fails.  The  optimal  2-median  clustering  is  {c,p,  q},  { d ,  p' } .  In  contrast,  when  we  run  our  algorithm 
over  on  this  instance,  single  linkage  first  connects  {c,p}  with  {c/.  //},  and  only  then  merges  these 
4  points  with  q. 


Second,  observe  that  our  analysis,  though  emanating  from  perturbation  resilience,  only 
uses  center  proximity.  We  next  show  that  for  general  metrics,  one  cannot  hope  to  solve  (in 
poly-time)  A  - median  instances  satisfying  a-ccntcr  proximity  for  a  <  3.  This  is  close  to 
our  upper  bound  of  2  +  y/3  for  general  metrics. 

Theorem  8.11.  For  any  a  <  3,  the  problem  of  solving  k-median  instances  over  general 
metrics  that  satisfy  a-center  proximity  is  NP -hard. 


Proof  The  proof  of  Theorem  8.11  follows  from  the  classical  reduction  of  Max-/c-Coverage 
to  Ac-median.  In  this  reduction,  we  create  a  bipartite  graph  where  the  right-hand  side  ver¬ 
tices  represent  the  elements  in  the  ground  set;  the  left-hand  side  vertices  represent  the 
given  subsets;  and  the  distance  between  the  set-vertex  and  each  element-vertex  is  1,  if  the 
set  contains  that  element.  Using  shortest-path  distances,  it  follows  that  the  distance  from 
any  element-vertex  to  a  set- vertex  to  which  it  does  not  belong  to  is  at  least  3.  Using  the  fact 
that  the  NP-hardness  results  for  Max-A-Covcrage  holds  for  disjoint  sets  (i.e.  the  optimal 
solution  of  Yes-instances  is  composed  of  k  disjoint  sets,  see  [68]),  the  a-center  proximity 
property  follows.  □ 


Lastly,  we  comment  that  using  Single-Linkage  in  the  usual  way  (namely,  stopping 
when  there  are  k  clusters  remaining)  is  not  sufficient  to  produce  a  good  clustering.  We 
demonstrate  this  using  the  example  shown  in  Figure  8.4.  Observe,  in  this  instance,  since 
C  contains  significantly  less  points  than  A, B,  or  D,  this  instance  is  stable  -  even  if  we 
perturb  distances  by  a  factor  of  3,  the  cost  of  any  alternative  clustering  is  higher  than  the 
cost  of  the  optimal  solution.  However,  because  d(A,C)  >  d(B,D ),  it  follows  that  the 
usual  version  of  Single -Linkage  will  unite  B  and  D,  and  only  then  A  and  C.  Hence,  if  we 
stop  the  Single -Linkage  algorithm  at  k  =  3  clusters,  we  will  not  get  the  desired  clustering. 
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Figure  8.4:  An  example  showing  that  the  usual  version  of  Single-Linkage  fails.  The  instance 
is  composed  of  4  components,  each  with  inner-distance  e  and  outer-distance  as  described  in  the 
figure.  However,  components  A,  B  and  D  each  contain  100  points,  whereas  component  C  has 
only  10  points.  The  optimal  3-median  clustering  consists  of  3  clusters:  { A ,  C},  {B},  {D}  and  has 
cost  OPT  =  200  +  300e. 

8.3  Future  Directions 

There  are  several  natural  open  questions  left  by  this  work.  First,  can  one  reduce  the  pertur¬ 
bation  factor  a  needed  for  efficient  clustering?  As  mentioned  earlier,  recently  Balcan  and 
Liang  [29]  have  given  a  very  interesting  algorithm  that  reduces  the  a  =  3  factor  needed 
by  our  algorithm  for  finite  metrics  to  1  +  y/2.  Additionally,  they  also  extend  our  work  in 
a  different  direction,  giving  algorithms  for  clustering  instances  that  are  “mostly  resilient” 
to  a-pcrturbations:  having  the  property  that  under  any  a-pcrturbation  of  the  underlying 
metric,  no  more  than  a  4- fraction  of  the  points  get  mislabeled  under  the  optimal  solution. 
Reyzin  [127]  gives  a  NP-hardness  for  clustering  2-center  proximity  instances  (with  no 
Steiner  points)  as  well  as  a  single-pass  algorithm  that  cluster  instances  which  are  ~  5.7 
perturbation  resilient. 

Alternatively,  one  can  consider  a  weaker  notion  of  resilience  to  perturbations  on  aver¬ 
age :  a  clustering  instance  whose  optimal  clustering  is  likely  not  to  change,  assuming  the 
perturbation  is  random  from  some  suitable  distribution.  Can  this  weaker  notion  be  used  to 
still  achieve  positive  guarantees? 
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Chapter  9 

Improved  Spectral-Norm  Bounds  for 
Clustering 

9.1  Introduction 

In  the  long-studied  field  of  clustering,  there  has  been  substantial  work  [56,  59,  13,  138,  4, 
50,  96,  54,  47]  studying  the  problem  of  clustering  data  from  mixture  of  distributions  under 
the  assumption  that  the  means  of  the  distributions  are  sufficiently  far  apart.  Each  of  these 
works  focuses  on  one  particular  type  (or  family)  of  distribution,  and  devise  an  algorithm 
that  successfully  clusters  datasets  that  come  from  that  particular  type.  Typically,  they 
show  that  w.h.p.  such  datasets  have  certain  nice  properties,  then  use  these  properties  in  the 
construction  of  the  clustering  algorithm. 

The  recent  work  of  Kumar  and  Kannan  [104]  takes  the  opposite  approach.  First,  they 
define  a  separation  condition,  deterministic  and  thus  not  tied  to  any  distribution,  and  show 
that  any  set  of  data  points  satisfying  this  condition  can  be  successfully  clustered.  Having 
established  that,  they  show  that  many  previously  studied  clustering  problems  indeed  satisfy 
(w.h.p)  this  separation  condition.  These  clustering  problems  include  Gaussian  mixture- 
models,  the  Planted  Partition  model  of  McSherry  [110]  and  the  work  of  Ostrovsky  et 
al  [121].  In  this  aspect  they  aim  to  unify  the  existing  body  of  work  on  clustering  under 
separation  assumptions,  proving  that  one  algorithm  applies  in  multiple  scenarios.1 

'We  comment  that,  implicitly,  Achlioptas  and  McSherry  [4]  follow  a  similar  approach,  yet  they  focus 
only  on  mixtures  of  Gaussians  and  log-concave  distributions.  In  addition,  the  work  of  [53]  studies  a  deter¬ 
ministic  separation  condition  required  for  efficient  clustering,  extending  the  separation  condition  of  [110]. 
The  precise  condition  presented  in  [53]  is  technical  but  essentially  assumes  that  the  underlying  graph  over 
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However,  the  attempt  to  unify  multiple  clustering  works  is  only  successful  in  part. 
First,  Kumar  and  Kannan’s  analysis  is  “wasteful”  w.r.t  the  number  of  clusters  k.  Clearly, 
motivated  by  an  underlying  assumption  that  k  is  constant,  their  separation  bound  has  linear 
dependence  in  k  and  their  classification  guarantee  has  quadratic  dependence  on  k.  As  a 
result,  Kumar  and  Kannan  overshoot  best  known  bounds  for  the  Planted  Partition  Model 
and  for  mixture  of  Gaussians  by  a  factor  of  \fk.  Similarly,  the  application  to  datasets 
considered  by  Ostrovsky  et  al  only  holds  for  constant  k.  Secondly,  the  analysis  in  Kumar- 
Kannan  is  far  from  simple  -  it  relies  on  most  points  being  “good”,  and  requires  multiple 
iterations  of  Lloyd  steps  before  converging  to  good  centers.  Our  work  addresses  these 
issues. 

To  formally  define  the  separation  condition  of  [104],  we  require  some  notation.  Our 
input  consists  of  n  points  in  Md.  We  view  our  dataset  as  a  n  x  d  matrix,  A,  where  each 
datapoint  corresponds  to  a  row  Aj  in  this  matrix.  We  assume  the  existence  of  a  target 
partition,  C*  =  {Cf,  C%, . . . ,  C£},  where  each  cluster’s  center  is  p(C *)  =  J2ieC*  A, 
where  nr  =  C*  \ .  Thus,  the  target  clustering  is  represented  by  a  n  x  d  matrix  of  cluster 
centers,  C,  where  the  ith  row  of  C  equals  fi(C* )  iff  t  G  C*.  Therefore,  the  A- means  cost  of 
this  partition  is  the  squared  Frobenius  norm  ||  A  —  C\\j?,  but  the  focus  of  this  paper  is  on  the 
spectral  (L2)  norm  of  the  matrix  A— C.  Indeed,  the  deterministic  equivalent  of  the  maximal 
variance  in  any  direction  is,  by  definition,  ^||A  —  C\\2  =  max{„:  ||^||=i}  ^||(A  —  C)v ||2. 


Definition.  Fix  i  G  C*.  We  say  a  datapoint  A,  satisfies  the  Kumar-Kannan  proximity  con¬ 
dition  if  for  any  s  f  r,  when  projecting  Aj  onto  the  line  connecting  fir  and  ps,  the  projec¬ 
tion  of  Ar  is  closer  to  fi(Cf )  than  to  p(C*)  by  an  additive  factor  of  (kj-A=  +  — A=)  ||  ^4  —  C|| 


Kumar  and  Kannan  proved  that  if  all  but  at  most  e-fraction  of  the  data  points  satisfy 
the  proximity  condition,  they  can  find  a  clustering  which  is  correct  on  all  but  an  0(k2e)- 
fraction  of  the  points.  In  particular,  when  e  =  0,  their  algorithm  clusters  all  points  cor¬ 
rectly.  Observe,  the  Kumar-Kannan  proximity  condition  gives  that  the  distance  ||/jr  —  /rs  | 
is  also  bigger  than  the  above  mentioned  bound.  The  opposite  also  holds  -  one  can  show 
that  if  \\pr  —  /xs||  is  greater  than  this  bound  then  only  few  of  the  points  do  not  satisfy  the 
proximity  condition. 


the  set  of  points  has  a  “low  rank  structure’’  and  presents  an  algorithm  to  recover  this  structure  which  is  then 
enough  to  cluster  well. 
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9.1.1  Our  Contribution 


Our  Separation  Condition.  In  this  work,  the  bulk  of  our  analysis  is  based  on  the  fol¬ 
lowing  quantitatively  weaker  version  of  the  proximity  condition,  which  we  call  center 
separation.  Formally,  we  define  Ar  =  -k-  min{\/fc||A  —  C\\,  ||  A  —  6'||  77}  and  we  assume 
throughout  the  paper  that  for  a  large  constant2  c  we  have  that  the  means  of  any  two  clusters 
C*  and  C*  satisfy 

\\p(c;)  -  p(c:)\\  >c(Ar  +  As)  (9.1) 

Observe  that  this  is  a  simpler  version  of  the  Kumar-Kannan  proximity  condition,  scaled 
down  by  a  factor  of  Vk.  Even  though  we  show  that  (9.1)  gives  that  only  a  few  points 
do  not  satisfy  the  proximity  condition,  our  analysis  (for  the  most  part)  does  not  partition 
the  dataset  into  good  and  bad  points,  based  on  satisfying  or  non- satisfying  the  proximity 
condition.  Instead,  our  analysis  relies  on  basic  tools,  such  as  the  Markov  inequality  and  the 
triangle  inequality.  In  that  sense  one  can  view  our  work  as  “aligning”  Kumar  and  Kannan’s 
work  with  the  rest  of  clustering-under-center-separation  literature  -  we  show  that  the  bulk 
of  Kannan  and  Kumar’s  analysis  can  be  simplified  to  rely  merely  on  center-separation. 


Our  results.  We  improve  upon  the  results  of  [104]  along  several  axes.  In  addition  to 
the  weaker  condition  of  Equation  (9.1),  we  also  weaken  the  Kumar-Kannan  proximity 
condition  by  a  factor  of  k,  and  still  retrieve  the  target  clustering,  if  all  points  satisfy  the 
(A-- weaker)  proximity  condition.  Secondly,  if  at  most  en  points  do  not  satisfy  the  A-- weaker 
proximity  condition,  we  show  that  we  can  correctly  classify  all  but  a  (e+0(l/c4))-fraction 
of  the  points,  improving  over  the  bound  of  [104]  of  0(k2e).  Note  that  our  bound  is  mean¬ 
ingful  even  if  e  is  a  constant  whereas  k  =  cu(l).  Furthermore,  we  prove  that  the  A-mcans 
cost  of  the  clustering  we  output  is  a  (1  +  0(l/c)) -approximation  of  the  /c-means  cost  of 
the  target  clustering. 

Once  we  have  improved  on  the  main  theorem  of  Kumar  and  Kannan,  we  derive  imme¬ 
diate  improvements  on  its  applications.  In  Section  9.3.1  we  show  our  analysis  subsumes 
the  work  of  Ostrovsky  et  al  [121],  and  applies  also  to  non-constant  k.  Using  the  fact  that 
Equation  (9.1)  “shaves  off”  a  \fk  factor  from  the  separation  condition  of  Kumar  and  Kan¬ 
nan,  we  obtain  a  separation  condition  of  il(am,lx\fk:)  for  learning  a  mixture  of  Gaussians, 

2We  comment  that  throughout  the  paper,  and  much  like  Kumar  and  Kannan,  we  think  of  c  as  a  large 
constant  (c  =  100  will  do).  However,  our  results  also  hold  when  c  =  w(l),  allowing  for  a  (1  +  o(l))- 
approximation.  We  also  comment  that  we  think  of  d.^>  k,  so  one  should  expect  ||H  —  CH^.  >  fc||H  —  C||2  to 
hold,  thus  the  reader  should  think  of  Ar  as  dependent  on  vA||  ,4  —  C\\.  Still,  including  the  degenerate  case, 
where  ||H  —  CWjr  <  k\\A  —  C\\,  simplifies  our  analysis  in  Section  9.3.  One  final  comment  is  that  (much  like 
all  the  work  in  this  field)  we  assume  k  is  given,  as  part  of  the  input,  and  not  unknown. 
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and  we  also  match  the  separation  results  of  the  Planted  Partition  model  of  McSherry  [110]. 
These  results  are  described  in  Section  9.5. 


Comparison  with  previous  stability  notions.  In  Chapter  7  we  studied  clustering  in¬ 
stances  satisfying  weak-deletion  stability  -  where  merging  two  clusters  in  the  optimal 
A- means  clustering  increases  the  cost  significantly.  In  Chapter  8  we  studied  the  notion 
of  instances  which  are  perturbation  resilient  -  where  any  3-multiplicative  change  to  the 
distances  doesn’t  change  the  optimal  A;-means  clustering.  Both  of  these  notions  are  very 
different  from  the  center-separation  notion  studied  here.  Think  of  the  motivating  example 
of  learning  a  mixture  of  A-Gaussians  in  a  high-dimensional  space  Wl  (with  d  >  k),  and 
with  A.r  =  1 1 /l  —  C'||.  In  this  example,  merging  clusters  r  and  s  by  assigning  the  nr 

points  in  cluster  r  to  /is  increases  the  cost  by  a  factor  of  nrA 2r  =  k\\A  —  C ||2.  Assuming 
A;||A  —  Cj|2  <C  ||  A  —  C\\p,  we  have  that  the  increase  in  cost  is  negligible  and  so  the  instance 
doesn’t  satisfy  weak-deletion  stability.  Similarly,  such  mixture  of  Gaussians  doesn’t  sat¬ 
isfy  perturbation  resilient,  as  the  distance  between  cluster  centers  (under  the  right  choice 
of  parameters)  is  even  smaller  than  the  average  distance  of  a  point  to  its  own  cluster  center. 
As  we  show,  it  is  only  after  we  project  the  instance  onto  the  subspace  spanned  by  its  top 
k  singular  vectors  that  we  get  an  instance  where  distances  between  centers  are  large  in 
comparison  to  distances  inside  a  cluster. 

We  comment  that  indeed,  in  the  case  where  Ar  =  -^-||A  —  Cj|  p,  then  (as  we  discuss 
in  Section  9.3.1)  we  do  have  an  instance  which  is  of  the  type  considered  by  Ostrovsky  et 
al  [121]  -  and  therefore  the  instance  is  also  weak-deletion  stable.  Still,  the  algorithm  we 
propose  here  is  deterministic  (as  opposed  to  the  randomized  algorithm  of  [121])  and  its 
running  time  is  much  smaller  than  the  running  time  of  the  algorithm  detailed  in  Chapter  7 
(no  exponential  dependency  on  a  parameter  /3).  Finally,  we  comment  that  one  can  design 
instances  in  small  dimension  (even  d  —  2)  satisfying  perturbation  resilience  and  yet  they 
do  not  satisfy  center  separation.  An  example  of  such  instance  was  given  in  Figure  8.1. 


Organization.  To  formally  detail  our  results,  we  first  define  some  notations  and  discuss 
a  few  preliminary  facts.  The  next  section  (Section  9.2)  contains  the  details  of  the  prelimi¬ 
naries,  as  well  as  a  formal  statement  of  our  algorithm  and  its  guarantees,  and  an  overview 
of  the  proof.  The  first  part  of  the  analysis  of  our  algorithms  is  in  Section  9.3,  and  it  is 
enough  for  us  to  give  a  “one-line”  proof  in  Section  9.3.1  showing  how  the  work  of  Ostro¬ 
vsky  et  al  falls  into  our  framework.  The  second  part  of  the  analysis  of  our  algorithm  is  in 
Section  9.4.  The  improved  guarantees  we  get  by  applying  the  algorithm  to  the  Planted  Par¬ 
tition  model  and  to  the  Gaussian  mixture  model  are  discussed  in  Section  9.5.  We  conclude 
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with  an  open  problem  in  Section  9.6. 


9.2  Notations  and  Preliminaries 

9.2.1  Notation 

The  Frobenius  norm  of  a  n  x  m  matrix  M,  denoted  as  ||M||jr  is  defined  as  ||M||^  = 
Mfj.  The  spectral  norm  of  M  is  defined  as  ||M||  =  maxx:||x||=i  ||Mx||.  It  is  a  well 

known  fact  that  if  the  rank  of  M  is  t,  then  <  I  \ \  M 1 1 2 .  Recall  the  Singular  Value 

Decomposition  (SVD)  of  M,  denoted  M  =  UY>VT,  where  U  is  an  x  n  unitary  matrix,  V 
is  a  m  x  m  unitary  matrix,  E  is  a  n  x  m  diagonal  matrix  whose  entries  are  nonnegative 
real  numbers,  and  its  diagonal  entries  satisfy  ay  >  a2  >  . . .  >  crmin{min}.  The  columns 
of  U  and  V,  denoted  u,  and  vt  resp.,  are  called  the  left-  and  right-singular  vectors.  As  a 
convention,  when  referring  to  singular  vectors,  we  mean  the  right-singular  vectors.  Pro¬ 
jecting  M  onto  its  top  t  singular  vectors  means  taking  M  =  (Ttuiv'[.  It  is  a  known 
fact  that  for  any  t,  the  t-dimensional  subspace  which  best  fits  the  rows  of  M,  is  obtained 
by  projecting  M  onto  the  subspace  spanned  by  the  top  t  singular  vectors  (corresponding 
to  the  top  t  singular  values).  Another  way  to  phrase  this  result  is  by  saying  that  M  = 
axgminjv:rank(jv)=t{||-^'  —  iV|| ^}.  (For  a  proof,  see  [97].)  The  same  matrix,  M,  also  mini¬ 
mizes  the  spectral  norm  of  this  difference,  meaning  M  =  arg  min Ar :rank( N)=t  { 1 1 M  —  TV | [ } 
(see  [73]  for  a  proof). 

As  previously  defined,  ||A  —  C \  denotes  the  spectral  norm  of  A  —  C.  We  abbreviate, 
and  denote  /ir  =  /i(C*).  From  this  point  on,  we  denote  the  projection  of  A  onto  the 
subspace  spanned  by  its  top  k- singular  vectors  as  A,  and  for  any  vector  v,  we  denote  v 
as  the  projection  of  v  onto  this  subspace.  Throughout  the  paper,  we  abuse  notation  and 
use  i  to  iterate  over  the  rows  of  A,  whereas  r  and  s  are  used  to  iterate  over  clusters  (or 
submatrices).  So  At  represents  the  zth  row  of  A  whereas  Ar  represents  the  submatrix 

9.2.2  Basic  Facts. 

The  analysis  of  our  main  theorem  makes  use  of  the  following  facts,  from  [110,  97,  104]. 
We  advise  the  reader  to  go  over  the  proofs,  which  are  short  and  elegant. 

The  first  fact  bounds  the  cost  of  assigning  the  points  of  A  to  their  original  centers. 
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Fact  9.1  (Lemma  9  from  [110]).  || A  —  C\\2F  <  8 minium  —  C ||2,  || A  —  C\\2F} 

8nrA2  for  every  r\ . 


Proof. 

||A  -  CfF  <  2k\\A  -  CH2  <  2k  (|| A  -  A||  +  \\A  -  C'H)2  <  2k  (2||A  -  C||)2 

where  the  first  inequality  holds  because  rank (/l  —  C)  <  2k,  and  the  last  inequality  follows 
from  the  fact  that  A  =  argminjv:rank(iv}=fc{||A  —  iVj|}.  For  the  same  reason,  |j  A  —  6'||  F  < 
||A  —  A\\f  +  ||A  —  C\\p  <  2||A  —  C\\F.  □ 

Next,  we  show  that  we  can  match  each  target  center  pr  to  a  unique,  relatively  close, 
center  ur  that  we  get  in  Part  I  of  the  algorithm  (see  Figure  9.1). 

Fact  9.2  (Claim  1  in  Section  3.2  of  [97]).  For  every  /ir  there  exists  a  center  us  s.t.  ||/xr  — 
vs\\  <  6Ar,  so  we  can  match  each  fir  to  a  unique  vr. 


Proof.  Observe  that  by  taking  A  —  C,  we  project  A  —  C  to  a  A  -dimcnsional  subspace,  so 
we  have  that  \\A-C\\2F  <  k\\A-  C\\2  <k\\A-  C\\2.  Similarly,  ||i  -  C\\2F  <\\A-  C\\2F. 

Assume  for  the  sake  of  contradiction  that  3 r  s.t.  1 1 //,.  —  us\\  >  6Ar  for  all  s.  Since 
||A  —  C ||p  <  nr A2,  then  our  10-approximation  algorithm  yields  a  clustering  of  cost  < 
l()nrA2.  In  contrast,  as  each  Aj  is  assigned  to  some  vcm,  the  contribution  of  only  the 
points  in  C*  to  the  A  - means  cost  of  the  clustering  is  more  than 

E  |(A*r  -  ^c(i))  -  (A  -fir)  >  y  (6Ar)2- E  \\^i~  pA\2  >  18nr  A2- 1|  A~C \\F  >  10nrA2 

i£Cf  i&C* 

where  the  first  inequality  follows  from  the  fact  that  (a  —  6) 2  >  \a2  —  b2 .  □ 


Finally,  we  exhibit  the  following  fact,  which  is  detailed  in  the  analysis  of  [104]. 

Fact  9.3.  Fix  a  target  cluster  C*  and  let  Sr  be  a  set  of  points  created  by  removing  poutnr 
points  from  C*  and  adding  pin  ( s )  nr  points  from  each  cluster  s  f  r,  s.t.  every  added  point 

x  satisfies  ||x  —  /xs||  >  |||x  —  pr\\.  Assume  pout  <  \  and  pin  =f  Pin(s)  <  Then 
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In  order  to  prove  Fact  9.3  we  use  the  following  Fact. 

Fact  9.4  (Lemma  5.2  and  Corollary  5.3  from  [104]).  Fix  any  cluster  C*  and  a  subset 
X  C  C*.  Then 

|A'|  MX)  -  HI  =  (|c;  -  |A|)  wet  \  X)  -  HI  <  TW\  II A-  -  ail 


Proof.  Let  u  \  be  the  indicator  vector  of  X.  Then 

II  \X\  (Ml)-fr)  ||  =  \\(Ar  -  Cr)T  ux\\  <  \\{Ar-Cr)T\\  |K||  =  \\Ar~  Cr\\^/\X\ 
and  the  fact  that  \X\  \\p,(X)  —  pr ||  =  | C*  \  X\  || fi(C*  \  X)  —  /ir ||  is  simply  because 

ft.  =  ,!§a*(a)  +  \x).  □ 

Proof  of  Fact  9.3.  We  break  ||/j(,S'r)  —  pr\\  into  its  components  and  deduce 

\\ii(sr)  -pr\\  <  (1  ~  pout)Ur  \\fi(sr  n  c*)  -  HrW  +  Y'  P^Ur  \^Sr  n  c *)  -  pr\\ 

nr  z '  « 


s^=r 


nr 


<  (1  Po",K  WA  n  c;)  -  ft ||  + 1 V 

nr 

s^r 


Pinif'fl'r 


nr 


Msrnc*s)-ps 


Plugging  in  Fact  9.4  we  have  || p(Sr)  -  pr\\  <  A.  (y/wnr  +  §  y/piJftyw)  \\A  - 
C || .  The  last  inequality  comes  from  maximizing  the  sum  of  square-roots  by  taking  each 

Pin('S)  Pin/k-  AH 


9.2.3  Formal  Description  of  the  Algorithm  and  Our  Theorems 

Having  established  notation,  we  now  present  our  algorithm,  in  Figure  9.1.  Our  algorithm’s 
goal  is  three  fold:  (a)  to  find  a  partition  that  identifies  with  the  target  clustering  on  the 
majority  of  the  points,  (b)  to  have  the  A  -mcans  cost  of  this  partition  comparable  with  the 
target,  and  (c)  output  k  centers  which  are  close  to  the  true  centers.  It  is  partitioned  into  3 
parts.  Each  part  requires  stronger  assumptions,  allowing  us  to  prove  stronger  guarantees. 

•  Assuming  only  the  center  separation  of  (9.1),  then  Part  I  gives  a  clustering  which 
(a)  is  correct  on  at  least  1  —  0(c~2)  fraction  of  the  points  from  each  target  clus¬ 
ter  (Theorem  9.5),  and  (b)  has  k- means  cost  smaller  than  (1  +  0(l/c))||A  —  Cj||, 
(Theorem  9.6). 
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Part  I:  Find  initial  centers: 

•  Project  A  onto  the  subspace  spanned  by  the  top  k  singular  vectors. 

•  Run  a  10-approximation  algorithm"  for  the  A  - means  problem  on  the  projected 
matrix  A,  and  obtain  k  centers  U\ ,  z/2,  •  •  • ,  ?4- 

Part  II:  Set  Sr  {i  :  \\At  —  ur\\  <  |||Aj  —  us\\,  for  every  s}  and  6r  n(Sr). 

Part  III:  Repeatedly  run  Lloyd  steps  until  convergence. 

•  Set  ©r  {i  :  \\Ai  —  0r ||  <  \\Ai  —  6S  ||,  for  every  s}. 

•  Set  6r  = 

“Throughout  the  paper,  we  assume  the  use  of  a  10-approximation  algorithm.  Clearly,  it  is 
possible  to  use  any  /-approximation  algorithm,  assuming  c/t  is  a  large  enough  constant. 

Figure  9.1:  Algorithm  ^Cluster 

•  Assuming  also  that  Ar  =  -^S=\\A  —  Cj|,  i.e.  assuming  the  non-degenerate  case  where 
||A  —  C\\2f  >  k\\A  —  Cj|2,  then  Part  II  finds  centers  that  are  0(l/c)^=^  close  to 
the  true  centers  (Theorem  9.7).  As  a  result  (see  Section  9.4.1),  if  (1  —  e)n  points 
satisfy  the  proximity  condition  (weakened  by  a  A;  factor,),  then  we  misclassify  no 
more  than  (e  +  0(c-4))n  points. 

•  Assuming  all  points  satisfy  the  proximity  condition  (weakened  by  a  A;-factor),  Part 
III  finds  exactly  the  target  partition  (Theorem  9.14). 


9.2.4  Proofs  Overview 

Proof  outline  for  Section  9.3.  The  first  part  of  our  analysis  is  an  immediate  application 
of  Facts  9.1  and  9.2.  Our  assumption  dictates  that  the  distance  between  any  two  centers 
is  big  (>  c(Ar  +  As)).  Part  I  of  the  algorithm  assigns  each  projected  point  A,  to  the 
nearest  ur  instead  of  the  true  center  /ir  and  Fact  9.2  assures  that  the  distance  |/jr  —  vr  || 
is  small  (<  6Ar).  Consider  a  misclassified  point  A,,  where  || A*  —  /ir ||  <  ||Aj  —  fis\\  yet 
\\Ai~  us\\  <  ||Aj  —  ur\\.  The  triangle  inequality  assures  that  Aj  has  a  fairly  big  distance  to  its 
true  center  (>  (|  —  12)Ar).  We  deduce  that  each  misclassified  point  contributes  0(c2A2) 
to  the  A;-means  cost  of  assigning  all  projected  points  to  their  true  centers.  Fact  9.1  bounds 
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this  cost  by  ||A  —  C\\2F  <  8nrA2,  so  the  Markov  inequality  proves  only  a  few  points 
are  misclassified.  Additional  application  of  the  triangle  inequality  for  misclassified  points 
gives  that  the  distance  between  the  original  point  A;  and  a  true  center  p,r  is  comparable  to 
the  distance  || Aj  —  /j,s\\,  and  so  assigning  A,;  to  the  cluster  s  only  increases  the  /.-means 
cost  by  a  small  factor. 


Proof  outline  for  Section  9.4.  In  the  second  part  of  our  analysis  we  compare  between 
the  true  clustering  C*  and  some  proposed  clustering  S,  looking  both  at  the  number  of 
misclassified  points  and  at  the  distances  between  the  matching  centers  | //,,  —  9r\\.  As 
Kumar  and  Kannan  show,  the  two  measurements  are  related:  Fact  9.3  shows  how  the 
distances  between  the  means  depend  on  the  number  of  misclassified  points,  and  the  main 
lemma  (Lemma  9.11)  essentially  shows  the  opposite  direction.  These  two  relations  are 
how  Kumar  and  Kannan  show  that  Lloyd  steps  converge  to  good  centers,  yielding  clusters 
with  few  misclassified  points.  They  repeatedly  apply  (their  version  of)  the  main  lemma, 
showing  that  with  each  step  the  distances  to  the  true  means  decrease  and  so  fewer  of  the 
good  points  are  misclassified. 

To  improve  on  Kumar  and  Kannan  analysis,  we  improve  on  the  two  above-mentioned 
relations.  Lemma  9.11  is  a  simplification  of  a  lemma  from  Kumar  and  Kannan,  where 
instead  of  projecting  into  a  A  -dimcnsional  space,  we  project  only  into  a  4-dimensional 
space,  thus  reducing  dependency  on  k.  However,  the  dependency  of  Fact  9.3  on  k  is  tight3. 
So  in  Part  II  of  the  algorithm  we  devise  sub-clusters  Sr  s.t.  pin(s)  =  Pout/k2.  The  crux 
in  devising  Sr  lies  in  Proposition  9.10  -  we  show  that  any  misclassified  projected  point 
i  G  C*  fl  Sr  is  essentially  misclassified  by  /ir.  And  since  (see  [4])  \\pr  —  f2r\\  <  ^=Ar 
(compared  to  the  bound  \\pr  —  ur\\  <  6Ar),  we  are  able  to  give  a  good  bound  on  pin(s). 

Recall  that  we  rely  only  on  center  separation  rather  than  a  large  batch  of  points  satisfy¬ 
ing  the  Kumar-Kannan  separation,  and  so  we  do  not  apply  iterative  Lloyd  steps  (unless  all 
points  are  good).  Instead,  we  apply  the  main  lemma  only  once,  w.r.t  to  the  misclassified 
points  in  C*  fl  Sr,  and  deduce  that  the  distances  1 1 pr  —  9r\\  are  small.  In  other  words,  Part 
II  is  a  single  step  that  retrieve  centers  whose  distances  to  the  original  centers  are  v^-times 
better  than  the  centers  retrieved  by  Kumar  and  Kannan  in  numerous  Lloyd  iterations. 


3In  fact.  Fact  9.3  is  exactly  why  the  case  of  k  =  w(l)  is  hard  -  because  the  L-\  and  norms  of  the 
vector  ,  ^=)  are  not  comparable  for  non-constant  k. 
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9.3  Part  I  of  the  Algorithm 


In  this  section,  we  look  only  at  Part  I  of  our  algorithm.  Our  approximation  algorithm 
defines  a  clustering  T,  where  Tr  =  {i  :  ||  —  vr\\  <  ||  Ai  —  vs\\  for  every  s}.  Our  goal  in 

this  section  is  to  show  that  T  is  correct  on  all  but  a  small  constant  fraction  of  the  points, 
and  furthermore,  the  A- means  cost  of  T  is  no  more  than  (1  +  0(l/c)j  times  the  A;-means 
cost  of  the  target  clustering. 

Theorem  9.5.  There  exists  a  matching  (given  by  Fact  9.2)  between  the  target  clustering 
C*  and  the  clustering  T  =  {Tr}r  where  Tr  =  {i  :  || Ai  —  ur\\  <  || Ai  —  us\\  for  every  s} 
that  satisfies  the  following  properties: 

•  For  every  cluster  C*Q  in  the  target  clustering,  no  more  than  0 ( 1  / c2 )  |  C*SQ  points  are 
misclassified. 

•  For  every  cluster  Tro  in  the  clustering  that  the  algorithm  outputs,  we  add  no  more 
than  0(1/  c2)\C*0  \  points  from  other  clusters. 

•  At  most  O  ( 1  / c2 )  C*2 1  points  are  misclassified  overall,  where  C*2  is  the  second  largest 
cluster. 

Proof.  Let  us  denote  T,_,r  as  the  set  of  points  A,  that  are  assigned  to  C*s  in  the  target 
clustering,  yet  are  closer  to  ur  than  to  any  other  v'r.  From  triangle  inequality  we  have  that 
|| Ai  —  ps ||  >  ||  Ai  —  us\\  —  || ps  —  us\\.  We  know  from  Fact  9.2  that  \\ps  —  vs ||  <  6AS.  Also, 
since  At  is  closer  to  ur  than  to  us,  the  triangle  inequality  gives  that  2 1 1  At  —  z/s  |  >  1 1  vr  —  vs  \ . 
So, 


|| Ai  -  /rs||  >  ^\\ur  -  us\\  -  6AS  >  ^|| pr  -  ps\\  -  12 (Ar  +  As)  >  |(Ar  +  As) 

Thus,  we  can  look  at  ||A  —  6'||  and  using  Fact  9.1  we  immediately  have  that  for  every 
fixed  r' 

Y,  Y  |r»4(A,  +  As)2  <  Y  Y  114  -  ^ll2  =  M  -  Cll f  <  8IV. A2, 

r  s^r  r  icC* 


The  proof  of  the  theorem  follows  from  fixing  some  r0  or  some  so  and  deducing: 

Al  v  |T„^|  <  53  |Tw_r|(Ar  +  A„)2  <  Y  Y  \T^r\(\  +  A,)2  <  4"»A2„ 

f^sq  r  s^r 
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A?„  5]  \T„„\  <  X  K^J(Aro  +  A,)2  <  E  E  |r«rl(Ar  +  As)2  <  -fnroA% 

s^ro  s^ro  r  s^r 

Observe  that  for  every  r  f  s  we  have  that  A.r  +  As  >  Ara  (where  V‘>  is  the  cluster  with 
the  second  largest  number  of  points),  so  we  have  that 

A2,  E  E  lr—l  <  E  E  |r«,l(Ar  +  As)2  <  ^nraA2a  □ 

r  s^r  r  s^r 


We  now  show  that  the  /c-means  cost  of  T  is  close  to  the  /c-means  cost  of  C*.  Observe 
that  the  fc-means  cost  of  T  is  computed  w.r.t  the  best  center  of  each  cluster  (i.e.,  n(Tr)), 
and  not  w.r.t  the  centers  vr. 

Theorem  9.6.  The  k-means  cost  of  T  is  at  most  (l  +  0(l/c))||A  —  C\\2F. 

Proof.  Given  T,  it  is  clear  that  the  centers  that  minimize  its  A- means  cost  are  /i(Tr)  = 
fh  SieTr  Ai.  Recall  that  the  majority  of  points  in  each  Tr  belong  to  a  unique  C*,  and  so, 
throughout  this  section,  we  assume  that  all  points  in  Tr  were  assigned  to  and  not  to 
/.i(Tr).  (Clearly,  this  can  only  increase  the  cost.)  We  show  that  by  assigning  the  points  of 
Tr  to  nr,  our  cost  is  at  most  (1  +  0(l/c))||A  —  Cj||,,  and  so  Theorem  9.6  follows.  In  fact, 
we  show  something  stronger.  We  show  that  by  assigning  all  the  points  in  Tr  to  each 
point  Ai  pays  no  more  than  (1  +  0(l/c))||A;  —  C)||2.  This  is  clearly  true  for  all  the  points 
in  Tr  D  C*.  We  show  this  also  holds  for  the  misclassified  points. 

Because  i  €  Ts_>r,  it  holds  that  ||Aj  —  ur\\  <  ||A;  —  us\\.  Observe  that  for  every  s  we 
have  that  || —  us\\2  =  \\Ai  —  Aj||2  +  || A{  —  vs ||2,  because  At  —  vs  is  the  projection  of 
Ai  —  us  onto  the  subspace  spanned  by  the  top  A; -singular  vectors  of  A.  Therefore,  it  is  also 
true  that  ||  Ai  —  ur  ||  <  \\At  —  vs  ||.  Because  of  Fact  9.2,  we  have  that  ||  fir  —  ur\\  <  6Ar  and 
Wfis  -  u8\\  <  6AS,  so  we  apply  the  triangle  inequality  and  get 

\\Ai  —  Hr\\  E  II  Aj  —  [J,s\\  +  ||  Hr  —  Vr  ||  +  \\f/,s  —  iss\\  <  ||  Aj  —  [J,s\\  f  1  H - rr 

V  ll-'T  -  /Ml 

So  all  we  need  to  do  is  to  lower  bound  \\At  —  /is\\.  As  noted,  ||Aj  —  vs\\  >  ||  Ai  —  vs\\.  Thus 

1  1 

\\Ai-  iis\\  >  \\Ai—  v8\\  —  6Ar  >  || A*  —  zzs||  —  6Ar  >  -||z/a  -  z/r||  -  6Ar  >  -c(Ar  +  As) 
and  we  have  the  bound  \\Ai  —  /j,r\\  <  (l  +  \\Ai  —  /j,s\\,  so  ||Aj  — /ir||2  <  (l  +  ||Aj  — 

Ms  II2-  □ 
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9.3.1  Application:  The  ORSS-Separation 


One  straight-forward  application  of  Theorem  9.6  is  for  the  datasets  considered  by  Ostro¬ 
vsky  et  al  [121],  where  the  optimal  A  - means  cost  is  an  e-fraction  of  the  optimal  (k  —  1)- 
means  cost.  Ostrovsky  et  al  proved  that  for  such  datasets  a  variant  of  the  Lloyd  method 
converges  to  a  good  solution  in  polynomial  time.  Kumar  and  Kannan  have  shown  that 
datasets  satisfying  the  ORSS-separation,  also  have  the  property  that  most  points  satisfy 
their  proximity-condition.  Their  analysis  is  not  immediate,  and  gives  a  (1  +  0(Vke))- 
approximation.  Here,  we  provide  a  “one-line”  proof  that  Part  I  of  Algorithm  ^Cluster 
yields  a  (1  +  0(y/e)) -approximation,  for  any  k. 

Suppose  we  have  a  dataset  satisfying  the  ORSS-separation  condition,  so  any  ( k  —  1)- 
partition  of  the  dataset  have  cost  >  \\\A  —  C\\2F.  For  any  r  and  any  s  f  r,  by  assigning 
all  the  points  in  C*  to  the  center  /as,  we  get  some  ( k  —  l)-partition  whose  cost  is  exactly 

P  -  C\\2p  +  nr\\nr  -  /is||2,  so  ll/v  -  ps\\  >  ^Pp  -  C\\F.  Setting  c  =  0(  1/p), 

Theorem  9.6  is  immediate. 


9.4  Part  II  of  the  Algorithm 

In  this  section,  our  goal  is  to  show  that  Part  II  of  our  algorithm  gives  centers  that  are  very 
close  to  the  target  clusters.  We  should  note  that  from  this  point  on,  we  assume  we  are  in 
the  non-degenerate  case,  where  \\A  —  C\\2F  >  k\\A  —  G'||2.  Therefore,  Ar  =  —  G'||. 

Recall,  in  Part  II  we  define  the  sets  Sr  —  {i  :  ||  A  —  ur\\  <  \\\Ai  —  z/s||,  Vs  f  r}. 
Observe,  these  set  do  not  define  a  partition  of  the  dataset!  There  are  some  points  that  are 
not  assigned  to  any  Sr.  However,  we  only  use  the  centers  of  Sr.  We  prove  the  following 
theorem. 

Theorem  9.7.  Denote  Sr  =  {i  :  ||  A;  —  ry||  <  |[|  Ai  —  i/s||,  Vs  ^  r}.  Then  for  every  r  it 
holds  that  ||/i(*SV)  —  /dr\\  =  0(l/c)  -h=  || A  —  C||  =  0(-A^Ar). 

The  proof  of  Theorem  9.7  is  an  immediate  application  of  Fact  9.3  combined  with  the 
following  two  lemmas,  that  bound  the  number  of  misclassified  points.  Observe  that  for 
every  point  that  belongs  to  C*  yet  is  assigned  to  Sr  (for  s  f  r )  is  also  assigned  to  Tr 
in  the  clustering  T  discussed  in  the  previous  section.  Therefore,  any  misclassified  point 
i  G  C*  nSr  satisfies  that  ||  A  — /ir  1 1  <  (1  +  0(c_1))||  A  —  /js\\  as  the  proof  of  Theorem  9.6 
shows.  So  all  conditions  of  Fact  9.3  hold. 
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Lemma  9.8.  Assume  that  for  every  r  we  have  that  \\pr  —  ur\\  <  6Ar.  Then  at  most  'Afnr 
points  of  C*  do  not  belong  to  Sr. 

Lemma  9.9.  Redefine  Ts^r  as  the  set  C*  fl  Sr.  Assume  that  for  every  r  we  have  that 
||  pr  —  isr\\  <  6Ar.  Then  for  every  r  and  every  s  r  we  have  that  \Ts^.r\  =  ^  j  nr. 

Proof  of  Lemma  9.8.  First,  we  claim  that  if  i  is  such  that  1 1  At  —  pr\\  <  |A,r,  then  it  must 
be  the  case  that  i  e  Sr. 

This  is  a  simple  consequence  of  the  triangle  inequality,  bounding  ||  —  vr  II  <|| k  - 

pr  ||  +  || pr  —  ur\\  <  ((c/8)  +  6)Ar.  Yet,  for  every  s  f  r,  the  triangle  inequality  gives  that 
\\A  -  vs\\  >  \\pr  -  Ps\\  ~  \\A  -  pr\\  -  \\ps  -  vs\\  >  (c  -  f  -  6)(Ar  +  As).  Assuming 
c  >  48,  we  have  that  1 1  At  —  us 

All  that’s  left  is  to  show  that  the  number  of  i  E  C*  s.t.  || —  pr\\  >  |Ar  is  small. 
This  again  follows  from  the  Markov  inequality:  Since  ||A  —  C\\2F  <  8k\\A  —  C 1 1 2 ,  then  the 
number  of  such  points  is  at  most  T£Zcmnr-  A 

We  now  turn  to  proving  Lemma  9.9.  The  general  outline  of  the  proof  of  Lemma  9.9 
resembles  to  the  outline  of  the  proof  of  Lemma  9.8.  Proposition  9.10  exhibit  some  property 
that  every  point  in  Ts^r  must  satisfy,  and  then  we  show  that  only  few  of  the  points  in 
C*  satisfy  this  property.  Recall  that  fir  indicates  the  projection  of  pr  onto  the  subspace 
spanned  by  the  top  A  -singular  vectors  of  A. 

Proposition 9.10.  Fixi  G  C*  s.t.  \\A  —  fis\\  <  2||Aj  —  fir\\.  Then  \\A  —  us\\  <  3||Aj  —  ur\\, 
so  i  Sr. 

Proof.  First,  for  every  r  we  have  that  ||/jr  —  ivr\\  <  1 1 //,.  —  ur\\  <  6Ar,  as  fir  —  vr  is  a 
projection  of  p.r  —  vr- 

Let  us  fiddle  with  the  triangle  inequality,  in  order  to  obtain  a  lower  bound  on  ||  Ai  —  vr  || . 
We  have  that  3||Aj  —  (ir\\  A  1 1 /t.r  —  /xs 1 1  >  \\pr  —  /is||  —  (||/ir  —  ur\\  +  \\vr  —  /ir||)  —  (\\ps  — 
vs\\  +  || vs-  /ts||)  >  (c  -  12) (Ar  +  As),  thus  ||  A  ~  +-H  >  -  6)  (Ar  +  As). 

Assume  for  the  sake  of  contradiction  that  ||  A  ~  l/s \  >  3 1 1  At  —  vr  ||,  and  let  us  show 
this  yields  an  upper  bound  on  \\Ai  —  ur\\,  which  contradicts  our  lower  bound.  We  have  that 

6AS  >  || Ai  -  us\\  -  II Ai  -  fis\\  >  3||Aj  -  ur\\  -  2|| A  -  fir\\  >  || A  -  vr\\  -  2  •  6Ar 

It  follows  that  12(Ar  +  As)  >  ||  A  ~  vr\\  >  (+7-^  —  6)  (Ar  +  As).  Contradiction  (c  > 
60).  □ 
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Proposition  9.10,  shows  that  in  order  to  bound  Ts^r  it  suffices  to  bound  the  number 
of  points  in  C*  satisfying  || —  ps\\  >  2|| Ai  —  pr\\.  The  major  tool  in  providing  this 
bound  is  the  following  technical  lemma.  This  lemma  is  a  variation  on  the  work  of  [104], 
on  which  we  improve  on  the  dependency  on  k  and  simplify  the  proof. 

Lemma  9.11  (Main  Lemma).  Fix  a,  (3  >  0.  Fix  r  ^  s  and  let  0  and  (s  be  two  points  s.t. 
\\pr  —  0||  <  oiAr  and  ||/j,s  —  Cs|  <  ftAs.  We  denote  Ai  as  the  projection  of  Ai  onto  the  line 

connecting  Qr  and  Qs.  Define  X  =  G  C*s  :  || —  (js||  —  \\Ai  —  £r||  >  /?||0  —  Cr|||- 

Then  \X\  <  256^  ^(rnin {nr,ns}). 

Proof.  Let  V  be  the  subspace  spanned  by  the  following  4  vectors:  {pr,  ps,  Q}.  Denote 
Pv  as  the  projection  onto  V.  We  denote  v%  =  If  (A,),  and  observe  that  If  (fir)  =  fir,  and 
the  same  goes  for  ps,  (r  and  (s.  Observe  also  that,  as  a  projection,  \\PV(A— C)  \ \  <  ||^4 — C\\ 
(alternatively,  1 1  Py  1 1  =  !)• 

We  now  make  a  simple  observation.  Let  Ai  denote  the  projection  of  At  onto  the  line 
connecting  pr  and  ps.  Now,  the  inequality  \\Ai  —  /rs||  <  \\Ai  —  pr\\  holds  iff  the  inequality 
|| A  -  ps ||  <  || Ai  -  pr ||  holds  (because  || A  -  pr\\2  =  \\A{  -  A{ ||2  +  || At  -  pr ||2). 
Furthermore,  such  relation  holds  for  any  point  whose  projection  on  the  line  connecting  pr 
and  ps  is  identical  to  At.  In  particular,  if  W  is  any  subspace  containing  pr  and  ps,  then  the 
projection  of  A,  onto  W  is  closer  to  pr  than  to  ps  iff  At  is  closer  to  pr  than  to  ps.  Thus, 
since  ||^4j  —  /rs||  <  ||^4j  —  pr\\  then  \\vi  —  ps\\  <  \\vi  —  pr\\.  Furthermore,  as  (r  and  ( s  also 
belong  to  V,  then  the  projection  of  Ai  onto  the  line  connecting  (s  and  Q.  is  identical  to  the 
projection  of  vr  onto  the  same  line  (meaning,  Aj  =  v,).  So  v,  also  satisfies  the  inequality: 
\\Vi  -  Call  -  }\Vi  -Crll  >  0WCs  -  Crll.  and,  of  Course,  ||Vi  —  Cr- 1 1 2  =  IK  -Vi\\2+  ll^i  -Cr||2. 

The  proof  follows  from  upper-  and  lower-bounding  the  term  1 1  vr  —  C.s  1 1 2  —  |  Vi  —  C- 1 1 2  • 
We’ve  just  shown  a  lower  bound,  as  we  have  that 

1 1  vi  Oil  1 1  Vi  —  0 1 1  =  ( 1 1  Vi  Oil  1 1 0  Oil)  ( 1 1 0  011  T"  II 0  Oil)  0  1 1  o  Oil 

The  triangle  inequality  gives  that  ||ui  — 0||  <  ||nj  —  /rs|| +a(Ar  +  As),  and  that  ||uj  — Oil  > 

|K  —  pr\\  —  a(Ar  +  As),  so  we  have  the  upper  bound  of 

IK  ~  Oil2  -  IK  -  oil2  <  (IK  -  Afisll  +  tt(Ar  +  As))2  -  (||t)j  -  pr ||  -  a(Ar  +  As))2 

<  (H^i  -  Pr ||  +  a(Ar  +  As))2  -  (IK*  -  pr ||  -  a(Ar  +  As))2 

<  4a(Ar  +  As)|| Vi  -  pr  || 

Comparing  the  upper  and  the  lower  bound,  we  have  that  for  any  i  e  X  the  distance 
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\\vi  —  nr\\  >  a)^(Ar+As)  .  X  C  C* ,  the  Markov  inequality  concludes  the  proof 

\x\  (^Vk\\A-C\\]  — - - 1  <  X,  ll«i-fcl|2<  l|Py(^-C)||J.<4p-C||2 

\8«  )  min {nr,ns}  ^ 

Proof  of  Lemma  9.9.  Every  i  e  Ts_>r  must  satisfy  that  ||  A*  —  /ts||  >  2|| /l,  —  jlr\\  (Propo¬ 
sition  9.10).  Therefore,  we  must  have  that  ||  Aj  —  fis\\  >  2||  A*  —  flr ||,  where  we  denote  A* 
as  the  projection  of  A  onto  the  line  connecting  fir  with  fis  (simply  because  ||  A;  —  fis  ||2  = 

1 1  A'i  Ai  1 1  ^  T  ||  Ai  As  I]' . )  Therefore,  1 1  fir  As  1 1  A  ^  1 1 A ^  As  1 1 ,  so  1 1 A i  As  1 1  1 1  A%  jlr  1 1  7 

3  ||  fir  As  II  • 

Thus,  every  i  e  Ts_>r  satisfies  the  conditions  of  Lemma  9.11  with  (r  =  fir,(s  =  As> 
and  /3  =  1/3.  We  deduce  the  |Ts_>r|  <  a 2  min{nr,  ns},  where  a  is  the  bound  s.t.  for 

every  r,  \\p,r  —  Aril  <  ap^\\A  ~  C||.  Since  a  <  we  conclude  the  proof. 

The  fact  that  a  is  small  was  proven  by  Achlioptas  and  McSherry  (Theorem  1  of  [4]). 
Denote  ur  as  the  indicator  vector  of  C*.  Since  rank  ( (7)  <  k,  we  get 

II Hr  ~  Aril  =  —  ||(A  -  A)TUr\\  <  —  IKII  \\A  ~  i||  <  -L||A  ~  C\\  □ 

nr  nr  yjnr 

As  an  interesting  corollary,  Theorem  9.7  dictates  that  for  every  r  we  have  that  1 1 //,,  — 
Or  II  =  0(1/ c)\\nr  -  Aril- 

9.4.1  The  Proximity  Condition  -  Part  III  of  the  Algorithm 

Part  II  of  our  algorithm  returns  centers  0\ , _ ,  Ok  which  are  0(— 4=)  ||  A  —  C\\  close  to  the 

true  centers.  Suppose  we  use  these  centers  to  cluster  the  points:  ©s  =  {i  :  Vs',  ||  A,  — 
0S||  <  |  A,  —  (9.,/ 1|}.  It  is  evident  that  this  clustering  correctly  classifies  the  majority  of  the 
points.  It  correctly  classifies  any  point  i  e  C*  with  ||Aj  —  pr\\  —  ||Aj  —  ps\\  =  0(^4=)||A  — 
Cj|  for  every  r  A  s,  and  the  analysis  of  Theorem  9.5  shows  that  at  most  0(c~2)- fraction 
of  the  points  do  not  satisfy  this  condition.  In  order  to  have  a  direct  comparison  with 
the  Kumar-Kannan  analysis,  we  now  bound  the  number  of  misclassified  points  w.r.t  the 
fraction  of  points  satisfying  the  Kumar-Kannan  proximity  condition. 

Definition  9.12.  Denote  gapT)S  =  (^==  +  ^=)||A  —  Cj|.  Call  a  point  i  G  C*  y-good,  if  for 
every  r  f  s  we  have  that  the  projection  of  At  onto  the  line  connecting  and  //„  denoted 
Aj,  satisfies  that  ||  At  —  pr\\  —  ||  Aj  —  /xs||  >7  gapr  s;  otherwise  we  say  the  point  is  7-bad. 
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Corollary  9.13.  If  the  number  ofy-bad  points  is  en,  then  (a)  the  clustering  { ©  i , . . . ,  ©fc} 
misclassifies  no  more  than  +  yryl'j  n  points,  and  (b)  e  <  O  ^(c  —  assuming 

7  <  c\fk. 


Proof.  Clearly,  all  en  bad  points  may  be  misclassified.  In  addition,  for  every  r  and  s  f  r, 
Lemma  9.11  (setting  ( r  =  9r,  ( s  =  9S,  a  =  l/c\/k  and  f3  =  ^(7 / (cvT)))  proves  that  no 
more  than  0('y~2c~2k~1)ns  good  points  can  be  misclassified.  Summing  Yhs^r  \ns  — 
we  conclude  (a). 

The  proof  of  (b)  is  similar  to  the  proof  of  Theorem  9.5.  We  look  at  the  A-mcans  cost 
of  ||  A  —  C\\p.  We  show  that  all  7-bad  points  contribute  a  large  amount  to  this  cost. 

Take  A,:  to  be  a  7-bad  point  from  C*.  Projecting  it  down  to  the  line  connecting  pr  and 
ps,  we  denote  the  projection  as  A*.  Clearly,  \\pr  —  ps\\  =  || pr  —  Ai\\  +  || Ai  —  ps\\  > 
c\/kgapriS  whereas  \\pr  —  Ai\\  —  \\Ai  —  p8\\  <  7 gapTtS .  It  follows  that  \\Ai  —  ps\\  > 
||  Ai  —  ps\\  >  \{cy/k  —  7 )gapr:S  >  |[^4  —  C ||.  Again,  the  Markov  inequality  gives 

that 


#{bad  points  from  C*} 


(cs/k  -  y)2 
4  ns 


\\A-C\\2<\\A-C\\2F<8k\\A-C\\2 


so  from  each  cluster,  only  a  fraction  of  32 


\/k 

cvfc— 7 


of  the  points  can  be  bad. 


□ 


Observe  that  Corollary  9.13  allows  for  multiple  scaled  versions  of  the  proximity  con¬ 
dition,  based  on  the  magnitude  of  7.  In  particular,  setting  7  =  1  we  get  a  proximity 
condition  whose  bound  is  independent  of  k,  and  still  our  clustering  misclassifies  only  a 
small  fraction  of  the  points  -  at  most  0(c~2)  fraction  of  all  points  might  be  misclassified 
because  they  are  1-bad,  and  no  more  than  a  O  (c~ 4 ) -fraction  of  1-good  points  may  be  mis¬ 
classified.  In  addition,  if  there  are  no  1-bad  points  we  show  the  following  theorem.  The 
proof  (omitted)  merely  follows  the  Kumar-Kannan  proof,  plugging  in  the  better  bounds, 
provided  by  Lemma  9.11. 

Theorem  9.14.  Assume  all  data  points  are  1-good.  That  is,  for  every  point  A,  that  belongs 
to  the  target  cluster  Tc^  and  every  s  f  c(f),  by  projecting  Ai  onto  the  line  connecting 
pc(i)  with  ps  we  have  that  the  projected  point  A*  satisfies  ||Aj  —  pc(i)\\  —  ||A,  —  ps\\  = 

n  +  vfe))  II- A  ~  CW’  whereas  I -  /All  =  A  (v^(^==  +  7=))  \\A  -  C\\. 

Then  the  Lloyd  method,  starting  with  ,  9k,  converges  to  the  true  centers. 
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9.5  Applications 


Clustering  a  mixture  of  Gaussians  For  a  mixture  of  k  Gaussians,  we  quote  the  suitable 
results  without  proof,  as  the  proof  is  identical  to  the  proof  in  [104].  We  are  given  a  mixture 
of  k  Gaussians,  F\, ,  Fk,  where  the  standard  deviation  of  each  distribution  in  any  direc¬ 
tion  is  at  most  ay,  and  the  weight  of  each  distribution  is  wr.  We  denote  crmax  =  max,. {ay} 
and  w, mjn  =  minr  { 

Theorem  9.15.  Suppose  we  are  given  a  set  ofn  3>  — —  samples  from  a  mixture  ofk  Gaus- 

^min  _ 

sians,  such  that  for  every  r  f  s  it  holds  that  II  ur  —  II  >  ccrmax  *  /  — - —  poly  log  (  — —  ). 

V  ^min  \  ^min  J 

Then  w.h.p.  these  points  satisfy  the  proximity  condition. 

For  Gaussians,  the  best  known  separation  bound  is  Achlioptas  and  McSherry’s  bound  [4] 
of  fl(amax(u;“11n/2  +  yfk  log(/c  ■  mirijn,  2k})  )).  As  we  assume  k  is  large,  this  separation 
condition  is  n((7max(m~,y  +  Vk))  =  Therefore,  the  separation  bound 

of  Theorem  9.15  is  \fk  times  worse  than  the  best  known  bound.  However,  applying  Ku¬ 
mar  and  Kannan’s  boosting  technique  (Section  7  in  [104]),  that  replaces  the  polynomial 
dependency  in  wmin  with  a  logarithmic  one,  we  get: 

Theorem  9.16.  Suppose  we  are  given  a  set  of  n  3>  — —  samples  from  a  mixture  of  k 

^min 

Gaussians,  such  that  for  every  r  f  s  it  holds  that 

|| hr  ~  /A  ||  >  camaxVk  poly  log  (  — — 

\  Wmin 

Then  there  exists  an  algorithm  that  w.h.p.  correctly  classifies  all  points. 

Therefore,  if  for  any  r  and  r',  both  ar  ~  g,j  and  wr  ~  ,  then  both  [4]  and  The¬ 

orem  9.16  give  roughly  the  same  bound.  If  for  any  r  and  r'  we  have  that  ay  «  ar>,  yet 
Wmin  <C  then  Theorem  9.16  provides  a  better  bound.  If  for  any  r  and  r'  we  have  that 
wr  ~  wr',  yet  the  directional  standard  deviations  of  the  distributions  vary,  then  the  bound 
of  [4],  in  which  the  distance  between  any  two  cluster  centers  depends  only  the  parame¬ 
ters  of  these  two  distributions,  is  the  better  bound.  If  both  the  standard  deviations  and  the 
weights  vary  significantly  between  the  different  distributions,  then  better  bound  is  deter¬ 
mined  on  a  case  by  case  basis. 


McSherry’s  Planted  Partition  Model.  In  the  Planted  Partition  Model  [110,  10,  11] 
our  instance  is  a  random  n- vertex  graph  generated  by  using  an  implicit  partition  of  the  n 
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points  into  k  clusters.  There  exists  an  unknown  k  x  k  matrix  of  probabilities  P,  and  for 
every  pair  of  vertices  u,  v  there  exists  an  edge  connecting  u  and  v  w.p.  Prs  (assuming  u 
belongs  to  cluster  r  and  v  to  cluster  s).  The  goal  here  is  to  recover  the  partition  of  the 
points  (thus  -  recover  P).  Viewing  this  graph  as  a  n  x  n  matrix,  each  row  is  taken  from  a 
special  distribution  Fr  over  (0,  l}n  -  where  each  coordinate  j  is  an  independent  Bernoulli 
r.v.  with  mean  Pr,c(j)-  denoting  C (j )  as  the  cluster  j  belongs  to.  Thus,  the  mean  of  this 
distribution,  is  a  vector  with  its  j-coordinate  set  to  Pr,c(j)-  Denote  wmin  =  minr{^} 
and  amax  =  niaxr.s  \J  Prs.  The  result  of  [1 10]  is  that  if  for  every  r^s 

\\dr  -  ds\\  =  D  (  amaxVk  (— - h  log(n/5)  )  )  (9.2) 

V  V^min  J  J 

then  it  is  possible  to  retrieve  the  partition  of  the  vertices  w.p.  at  least  1  —  5. 

Kumar  and  Kannan  were  not  able  to  match  the  distance  bounds  of  McSherry,  and 
required  centers  to  be  \/k  factor  greater  then  the  bound  of  (9.2).  Here  we  match  the  bound 
of  McSherry  exactly.  Following  the  proof  in  Kumar- Kannan  (with  few  changes),  we  prove: 

Theorem  9.17.  Assuming  that  rrlnax  >  31o^")  and  that  the  planted  partition  model  satisfies 
equation  9.2  for  every  r  f  s,  then  w.p.  at  least  1  —  5,  every  point  satisfies  the  proximity 
condition. 


Proof.  We  follow  the  proof  of  Kumar-Kannan,  making  the  suitable  changes.  McSherry 
(Theorem  10  of  [110])  showed  that  w.h.p.  \\A  —  C\\  <  4crmaxA/n.  So  our  goal  is  to  show 
that,  w.h.p.,  all  points  are  v^-good.  I.e.,  denoting  u  as  a  unit-length  vector  connecting  fir 
and  /js,  we  show  that  w.h.p.  that  for  every  i  (E  C*  we  have 

|  (A  -  dr)  ■  u\  =  0(Vkamax  (  — - h  log(n/5)  )) 

V^min  / 


Observe  u  =  >  and  due  to  the  special  structure  of  the  means  in  this  model,  we 


have  that  (//,.  —  =  Prt  —  Pst  where  j  G  Tt.  It  follows  that 


-  da\\ 2  =  YMPrt  ~  Pst ) 


t=  1 


We  therefore  have 


I  (Ai  pr)  u |  ^ 


||  dr  ds 


| Prt  ~  Pst\ 


J.= 1 


^  "  A  ij  Pi 
j&Tt 


rt 
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Observe,  At]  are  i.i.d  0-1  random  variables  with  mean  Prt,  so  we  expect  their  sum  to 
deviate  from  its  expectation  by  no  more  than  a  few  standard  deviations.  Indeed,  Kumar 
and  Kannan  prove  that  w.h.p.  it  holds  that  for  every  t  we  have 


Aij  P, 

jeTt 


rt 


A  By/lltO' max 


Wn 


+  \og(n/8) 


where  B  is  some  sufficiently  large  constant.  This  allows  us  to  deduce  that 


|  (A  -  nr)  ■  u\  <  Ba, 


1  i  1  (  /  X\\  ^-Jt=  l\/™t\Brt  Pst\ 

max  | - h  log(n/ 0)  1 


Wn 


Yjt=l  nt(Brt  -  Pst ) 


<  By/k, 


(Jn 


w  n 


+  log(n/6) 


where  the  last  inequality  is  simply  the  power-mean  inequality. 


□ 


9.6  An  Open  Problem 

Our  work  presents  an  algorithm  which  successfully  clusters  a  dataset,  provided  that  the 
distance  between  any  two  cluster  centers  meets  a  certain  lower  bound.  We  would  like  to 
point  out  one  particular  direction  to  improve  this  bound.  Note  that  our  center  separation 
bound  depends  on  ||A  —  C ||,  a  property  of  the  entire  dataset.  It  would  be  nice  to  handle 
the  case  where  the  separation  condition  between  /ir  and  /is  depends  solely  on  C*  and  C*. 
That  is,  if  we  define  Ar  =  ~i=\\ Ar  —  Cr  ||,  is  it  possible  to  successfully  separate  clusters 

s.t  \\fir  —  fis\\  >  c(Ar  +  As)?  We  comment  that  most  of  our  analysis  (and  particularly 
Lemma  9.11)  builds  only  on  the  ratio  between  || /xr  —  vr\\  and  \\nr  —  ns\\  -  we  assume 
the  first  is  no  greater  than  aAr  and  that  the  latter  is  no  less  than  c(Ar  +  A,s).  In  fact, 
one  can  revise  the  proofs  of  Theorems  9.5  and  9.6  so  that  they  will  hold  based  on  this 
assumption  alone  (without  using  the  properties  of  the  SVD).  The  problem  therefore  boils 
down  to  finding  initial  centers  {ry}  that  are  sufficiently  close  to  the  true  centers  {/ir}, 
under  the  assumption  that  Vr  ^  s,  \\/ir  —  /ds\\  >  c(Ar  +  As).  But  this  is  an  intricate 
task,  mainly  because  such  separation  condition  does  not  imply  that  {/ui,/u2, . . .  are 
the  centers  minimizing  the  A-mcans  cost!  (Nor  do  { /j i ,  jj2 , . . . ,  ///,..  }  minimize  the  k- 
means  cost  of  A.)  Consider  the  case,  for  example,  where  cluster  r  has  very  few  points 
(say  nr  =  \/n)  and  very  small  variance,  and  cluster  s  is  very  big  (say  ns  =  n/5),  and  is 
essentially  composed  of  two  sub-components  with  distance  ^-^||  As  —  6's||  between  the 
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centers  of  the  two  sub-components.  The  A- means  cost  of  placing  two  centers  within  Cs  is 
smaller  than  placing  one  center  at  fis  and  one  center  at  fir.  This  relates  to  the  question  of 
designing  a  ^-approximation  algorithm  for  A'- means,  guaranteeing  that  each  cluster’s  cost 
cannot  increase  by  more  than  a  factor  of  t. 
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